Data analysis in Python with pandas

What is pandas?

pandas: Open source library in Python for data analysis, data manipulation, and data visualisation.

Pros:

  1. Tons of functionality
  2. Well supported by community
  3. Active development
  4. Lot of documentation
  5. Plays well with other packages, for e.g NumPy, Scikit-learn

In [37]:
import pandas as pd

How do I read a tabular data file into pandas?

Tabular data file: By default tab separated file (tsv)


In [7]:
orders = pd.read_table('http://bit.ly/chiporders')

In [9]:
orders.head()


Out[9]:
order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
1 1 1 Izze [Clementine] $3.39
2 1 1 Nantucket Nectar [Apple] $3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98

In [17]:
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users = pd.read_table('http://bit.ly/movieusers', delimiter='|', header=None, names=user_cols)

In [18]:
users.head()


Out[18]:
user_id age gender occupation zip_code
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213

Tip: skiprows and skipfooter params are useful to omit extra data in a file heading or ending.

How do I select a pandas Series from a DataFrame?

Two basic data structures in pandas

  1. DataFrame: Table with rows and columns
  2. Series: Each columns is known as pandas Series

In [20]:
ufo = pd.read_csv('http://bit.ly/uforeports')

In [21]:
ufo.head()


Out[21]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00

In [22]:
type(ufo)


Out[22]:
pandas.core.frame.DataFrame

In [24]:
type(ufo['City'])


Out[24]:
pandas.core.series.Series

In [26]:
city = ufo.City

In [27]:
city.head()


Out[27]:
0                  Ithaca
1             Willingboro
2                 Holyoke
3                 Abilene
4    New York Worlds Fair
Name: City, dtype: object

Tip: Create a new Series in a DataFrame


In [28]:
ufo['location'] = ufo.City + ', ' + ufo.State

In [29]:
ufo.head()


Out[29]:
City Colors Reported Shape Reported State Time location
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00 Ithaca, NY
1 Willingboro NaN OTHER NJ 6/30/1930 20:00 Willingboro, NJ
2 Holyoke NaN OVAL CO 2/15/1931 14:00 Holyoke, CO
3 Abilene NaN DISK KS 6/1/1931 13:00 Abilene, KS
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00 New York Worlds Fair, NY

Why do some pandas commands end with parentheses, and other commands don't?


In [31]:
movies = pd.read_csv('http://bit.ly/imdbratings')

In [32]:
movies.head()


Out[32]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....

In [33]:
movies.describe()


Out[33]:
star_rating duration
count 979.000000 979.000000
mean 7.889785 120.979571
std 0.336069 26.218010
min 7.400000 64.000000
25% 7.600000 102.000000
50% 7.800000 117.000000
75% 8.100000 134.000000
max 9.300000 242.000000

In [34]:
movies.shape


Out[34]:
(979, 6)

In [36]:
movies.dtypes


Out[36]:
star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        object
dtype: object

In [37]:
type(movies)


Out[37]:
pandas.core.frame.DataFrame

In [38]:
movies.describe(include=['object'])


Out[38]:
title content_rating genre actors_list
count 979 976 979 979
unique 975 12 16 969
top The Girl with the Dragon Tattoo R Drama [u'Daniel Radcliffe', u'Emma Watson', u'Rupert...
freq 2 460 278 6

Tip: Hit "Shift+Tab" inside a method parentheses to get list of arguments

How to rename columns in pandas DataFrame?


In [39]:
ufo.head()


Out[39]:
City Colors Reported Shape Reported State Time location
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00 Ithaca, NY
1 Willingboro NaN OTHER NJ 6/30/1930 20:00 Willingboro, NJ
2 Holyoke NaN OVAL CO 2/15/1931 14:00 Holyoke, CO
3 Abilene NaN DISK KS 6/1/1931 13:00 Abilene, KS
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00 New York Worlds Fair, NY

In [40]:
ufo.columns


Out[40]:
Index([u'City', u'Colors Reported', u'Shape Reported', u'State', u'Time',
       u'location'],
      dtype='object')

In [42]:
ufo.rename(columns={'Colors Reported': 'Colors_Reported', 'Shape Reported': 'Shape_Reported'}, inplace=True)

In [43]:
ufo.columns


Out[43]:
Index([u'City', u'Colors_Reported', u'Shape_Reported', u'State', u'Time',
       u'location'],
      dtype='object')

In [44]:
ufo_cols = ['city', 'colors reported', 'state reported', 'state', 'time', 'location']
ufo.columns = ufo_cols

In [45]:
ufo.head()


Out[45]:
city colors reported state reported state time location
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00 Ithaca, NY
1 Willingboro NaN OTHER NJ 6/30/1930 20:00 Willingboro, NJ
2 Holyoke NaN OVAL CO 2/15/1931 14:00 Holyoke, CO
3 Abilene NaN DISK KS 6/1/1931 13:00 Abilene, KS
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00 New York Worlds Fair, NY

In [50]:
ufo_cols = ['City', 'Colors Reported', 'State Reported', 'State', 'Time']
ufo = pd.read_csv('http://bit.ly/uforeports', names=ufo_cols, header=0)

In [51]:
ufo.head()


Out[51]:
City Colors Reported State Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00

Tip: Use str.replace method to drop the space from columns names


In [53]:
ufo.columns = ufo.columns.str.replace(' ', '_')

In [54]:
ufo.columns


Out[54]:
Index([u'City', u'Colors_Reported', u'State_Reported', u'State', u'Time'], dtype='object')

How do I remove columns from a pandas DataFrame?


In [56]:
ufo = pd.read_csv('http://bit.ly/uforeports')

In [57]:
ufo.head()


Out[57]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00

In [58]:
ufo.shape


Out[58]:
(18241, 5)

In [59]:
ufo.drop('Colors Reported', axis=1, inplace=True)

In [60]:
ufo.head()


Out[60]:
City Shape Reported State Time
0 Ithaca TRIANGLE NY 6/1/1930 22:00
1 Willingboro OTHER NJ 6/30/1930 20:00
2 Holyoke OVAL CO 2/15/1931 14:00
3 Abilene DISK KS 6/1/1931 13:00
4 New York Worlds Fair LIGHT NY 4/18/1933 19:00

In [61]:
ufo.drop(labels=['City', 'State'], axis=1, inplace=True)

In [62]:
ufo.head()


Out[62]:
Shape Reported Time
0 TRIANGLE 6/1/1930 22:00
1 OTHER 6/30/1930 20:00
2 OVAL 2/15/1931 14:00
3 DISK 6/1/1931 13:00
4 LIGHT 4/18/1933 19:00

Tip: To remove rows instead of columns, choose axis=0


In [63]:
ufo.drop([0, 1], axis=0, inplace=True)

In [65]:
ufo.shape


Out[65]:
(18239, 2)

How do I sort a pandas DataFrame or Series?


In [67]:
movies = pd.read_csv('http://bit.ly/imdbratings')

In [68]:
movies.head()


Out[68]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....

In [69]:
movies.title.sort_values()


Out[69]:
542                   (500) Days of Summer
5                             12 Angry Men
201                       12 Years a Slave
698                              127 Hours
110                  2001: A Space Odyssey
910                                   2046
596                               21 Grams
624                              25th Hour
708                       28 Days Later...
60                                3 Idiots
225                                 3-Iron
570                                    300
555                           3:10 to Yuma
427           4 Months, 3 Weeks and 2 Days
824                                     42
597                                  50/50
203                                  8 1/2
170                       A Beautiful Mind
941                       A Bridge Too Far
571                           A Bronx Tale
266                      A Christmas Story
86                      A Clockwork Orange
716                         A Few Good Men
750                    A Fish Called Wanda
276                   A Fistful of Dollars
612                     A Hard Day's Night
883                  A History of Violence
869              A Nightmare on Elm Street
865                        A Perfect World
426                              A Prophet
                      ...                 
207       What Ever Happened to Baby Jane?
562            What's Eating Gilbert Grape
719                When Harry Met Sally...
649                      Where Eagles Dare
33                                Whiplash
669                Who Framed Roger Rabbit
219        Who's Afraid of Virginia Woolf?
127                      Wild Strawberries
497    Willy Wonka & the Chocolate Factory
270                        Wings of Desire
483                           Withnail & I
920                                Witness
65             Witness for the Prosecution
970                            Wonder Boys
518                         Wreck-It Ralph
954                                  X-Men
248             X-Men: Days of Future Past
532                     X-Men: First Class
871                                     X2
695                      Y Tu Mama Tambien
403                             Ying xiong
235                                Yip Man
96                                 Yojimbo
280                     Young Frankenstein
535                                  Zelig
955                       Zero Dark Thirty
677                                 Zodiac
615                             Zombieland
526                                   Zulu
864                                  [Rec]
Name: title, dtype: object

In [70]:
movies.sort_values('title')


Out[70]:
star_rating title content_rating genre duration actors_list
542 7.8 (500) Days of Summer PG-13 Comedy 95 [u'Zooey Deschanel', u'Joseph Gordon-Levitt', ...
5 8.9 12 Angry Men NOT RATED Drama 96 [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
201 8.1 12 Years a Slave R Biography 134 [u'Chiwetel Ejiofor', u'Michael Kenneth Willia...
698 7.6 127 Hours R Adventure 94 [u'James Franco', u'Amber Tamblyn', u'Kate Mara']
110 8.3 2001: A Space Odyssey G Mystery 160 [u'Keir Dullea', u'Gary Lockwood', u'William S...
910 7.5 2046 R Drama 129 [u'Tony Chiu Wai Leung', u'Ziyi Zhang', u'Faye...
596 7.7 21 Grams R Crime 124 [u'Sean Penn', u'Benicio Del Toro', u'Naomi Wa...
624 7.7 25th Hour R Crime 135 [u'Edward Norton', u'Barry Pepper', u'Philip S...
708 7.6 28 Days Later... R Horror 113 [u'Cillian Murphy', u'Naomie Harris', u'Christ...
60 8.5 3 Idiots PG-13 Comedy 170 [u'Aamir Khan', u'Madhavan', u'Mona Singh']
225 8.1 3-Iron R Crime 88 [u'Seung-yeon Lee', u'Hyun-kyoon Lee', u'Hyuk-...
570 7.8 300 R Action 117 [u'Gerard Butler', u'Lena Headey', u'David Wen...
555 7.8 3:10 to Yuma R Adventure 122 [u'Russell Crowe', u'Christian Bale', u'Ben Fo...
427 7.9 4 Months, 3 Weeks and 2 Days NOT RATED Drama 113 [u'Anamaria Marinca', u'Vlad Ivanov', u'Laura ...
824 7.5 42 PG-13 Biography 128 [u'Chadwick Boseman', u'T.R. Knight', u'Harris...
597 7.7 50/50 R Comedy 100 [u'Joseph Gordon-Levitt', u'Seth Rogen', u'Ann...
203 8.1 8 1/2 NOT RATED Drama 138 [u'Marcello Mastroianni', u'Anouk Aim\xe9e', u...
170 8.2 A Beautiful Mind PG-13 Biography 135 [u'Russell Crowe', u'Ed Harris', u'Jennifer Co...
941 7.4 A Bridge Too Far PG Drama 175 [u'Sean Connery', u"Ryan O'Neal", u'Michael Ca...
571 7.8 A Bronx Tale R Crime 121 [u'Robert De Niro', u'Chazz Palminteri', u'Lil...
266 8.1 A Christmas Story PG Comedy 94 [u'Peter Billingsley', u'Melinda Dillon', u'Da...
86 8.4 A Clockwork Orange X Crime 136 [u'Malcolm McDowell', u'Patrick Magee', u'Mich...
716 7.6 A Few Good Men R Drama 138 [u'Tom Cruise', u'Jack Nicholson', u'Demi Moore']
750 7.6 A Fish Called Wanda R Comedy 108 [u'John Cleese', u'Jamie Lee Curtis', u'Kevin ...
276 8.1 A Fistful of Dollars R Action 99 [u'Clint Eastwood', u'Gian Maria Volont\xe9', ...
612 7.7 A Hard Day's Night APPROVED Comedy 87 [u'John Lennon', u'Paul McCartney', u'George H...
883 7.5 A History of Violence R Crime 96 [u'Viggo Mortensen', u'Maria Bello', u'Ed Harr...
869 7.5 A Nightmare on Elm Street R Horror 91 [u'Heather Langenkamp', u'Johnny Depp', u'Robe...
865 7.5 A Perfect World PG-13 Crime 138 [u'Kevin Costner', u'Clint Eastwood', u'Laura ...
426 7.9 A Prophet R Crime 155 [u'Tahar Rahim', u'Niels Arestrup', u'Adel Ben...
... ... ... ... ... ... ...
207 8.1 What Ever Happened to Baby Jane? APPROVED Drama 134 [u'Bette Davis', u'Joan Crawford', u'Victor Bu...
562 7.8 What's Eating Gilbert Grape PG-13 Drama 118 [u'Johnny Depp', u'Leonardo DiCaprio', u'Julie...
719 7.6 When Harry Met Sally... R Comedy 96 [u'Billy Crystal', u'Meg Ryan', u'Carrie Fisher']
649 7.7 Where Eagles Dare NaN Action 158 [u'Richard Burton', u'Clint Eastwood', u'Mary ...
33 8.6 Whiplash R Drama 107 [u'Miles Teller', u'J.K. Simmons', u'Melissa B...
669 7.7 Who Framed Roger Rabbit PG Animation 104 [u'Bob Hoskins', u'Christopher Lloyd', u'Joann...
219 8.1 Who's Afraid of Virginia Woolf? TV-MA Drama 131 [u'Elizabeth Taylor', u'Richard Burton', u'Geo...
127 8.3 Wild Strawberries UNRATED Drama 91 [u'Victor Sj\xf6str\xf6m', u'Bibi Andersson', ...
497 7.8 Willy Wonka & the Chocolate Factory G Family 100 [u'Gene Wilder', u'Jack Albertson', u'Peter Os...
270 8.1 Wings of Desire PG-13 Drama 128 [u'Bruno Ganz', u'Solveig Dommartin', u'Otto S...
483 7.8 Withnail & I R Comedy 107 [u'Richard E. Grant', u'Paul McGann', u'Richar...
920 7.5 Witness R Crime 112 [u'Harrison Ford', u'Kelly McGillis', u'Lukas ...
65 8.4 Witness for the Prosecution APPROVED Crime 116 [u'Tyrone Power', u'Marlene Dietrich', u'Charl...
970 7.4 Wonder Boys R Drama 107 [u'Michael Douglas', u'Tobey Maguire', u'Franc...
518 7.8 Wreck-It Ralph PG Animation 101 [u'John C. Reilly', u'Jack McBrayer', u'Jane L...
954 7.4 X-Men PG-13 Action 104 [u'Patrick Stewart', u'Hugh Jackman', u'Ian Mc...
248 8.1 X-Men: Days of Future Past PG-13 Action 131 [u'Patrick Stewart', u'Ian McKellen', u'Hugh J...
532 7.8 X-Men: First Class PG-13 Action 132 [u'James McAvoy', u'Michael Fassbender', u'Jen...
871 7.5 X2 PG-13 Action 134 [u'Patrick Stewart', u'Hugh Jackman', u'Halle ...
695 7.7 Y Tu Mama Tambien R Drama 106 [u'Maribel Verd\xfa', u'Gael Garc\xeda Bernal'...
403 7.9 Ying xiong PG-13 Action 99 [u'Jet Li', u'Tony Chiu Wai Leung', u'Maggie C...
235 8.1 Yip Man R Action 106 [u'Donnie Yen', u'Simon Yam', u'Siu-Wong Fan']
96 8.4 Yojimbo UNRATED Action 110 [u'Toshir\xf4 Mifune', u'Eijir\xf4 T\xf4no', u...
280 8.1 Young Frankenstein PG Comedy 106 [u'Gene Wilder', u'Madeline Kahn', u'Marty Fel...
535 7.8 Zelig PG Comedy 79 [u'Woody Allen', u'Mia Farrow', u'Patrick Horg...
955 7.4 Zero Dark Thirty R Drama 157 [u'Jessica Chastain', u'Joel Edgerton', u'Chri...
677 7.7 Zodiac R Crime 157 [u'Jake Gyllenhaal', u'Robert Downey Jr.', u'M...
615 7.7 Zombieland R Comedy 88 [u'Jesse Eisenberg', u'Emma Stone', u'Woody Ha...
526 7.8 Zulu UNRATED Drama 138 [u'Stanley Baker', u'Jack Hawkins', u'Ulla Jac...
864 7.5 [Rec] R Horror 78 [u'Manuela Velasco', u'Ferran Terraza', u'Jorg...

979 rows × 6 columns

Tip: Sort by multiple columns


In [71]:
movies.sort_values(['content_rating', 'duration'])


Out[71]:
star_rating title content_rating genre duration actors_list
713 7.6 The Jungle Book APPROVED Animation 78 [u'Phil Harris', u'Sebastian Cabot', u'Louis P...
513 7.8 Invasion of the Body Snatchers APPROVED Horror 80 [u'Kevin McCarthy', u'Dana Wynter', u'Larry Ga...
272 8.1 The Killing APPROVED Crime 85 [u'Sterling Hayden', u'Coleen Gray', u'Vince E...
703 7.6 Dracula APPROVED Horror 85 [u'Bela Lugosi', u'Helen Chandler', u'David Ma...
612 7.7 A Hard Day's Night APPROVED Comedy 87 [u'John Lennon', u'Paul McCartney', u'George H...
58 8.5 Paths of Glory APPROVED Drama 88 [u'Kirk Douglas', u'Ralph Meeker', u'Adolphe M...
210 8.1 Laura APPROVED Film-Noir 88 [u'Gene Tierney', u'Dana Andrews', u'Clifton W...
656 7.7 Snow White and the Seven Dwarfs APPROVED Animation 88 [u'Adriana Caselotti', u'Harry Stockwell', u'L...
844 7.5 Pinocchio APPROVED Animation 88 [u'Dickie Jones', u'Christian Rub', u'Mel Blanc']
233 8.1 The Night of the Hunter APPROVED Crime 92 [u'Robert Mitchum', u'Shelley Winters', u'Lill...
287 8.1 His Girl Friday APPROVED Comedy 92 [u'Cary Grant', u'Rosalind Russell', u'Ralph B...
484 7.8 The Day the Earth Stood Still APPROVED Sci-Fi 92 [u'Michael Rennie', u'Patricia Neal', u'Hugh M...
451 7.9 Miracle on 34th Street APPROVED Comedy 96 [u'Edmund Gwenn', u"Maureen O'Hara", u'John Pa...
229 8.1 Strangers on a Train APPROVED Crime 101 [u'Farley Granger', u'Robert Walker', u'Ruth R...
237 8.1 Notorious APPROVED Drama 101 [u'Cary Grant', u'Ingrid Bergman', u'Claude Ra...
521 7.8 Belle de Jour APPROVED Drama 101 [u'Catherine Deneuve', u'Jean Sorel', u'Michel...
285 8.1 Bringing Up Baby APPROVED Comedy 102 [u'Katharine Hepburn', u'Cary Grant', u'Charle...
94 8.4 Singin' in the Rain APPROVED Comedy 103 [u'Gene Kelly', u"Donald O'Connor", u'Debbie R...
575 7.8 Kramer vs. Kramer APPROVED Drama 105 [u'Dustin Hoffman', u'Meryl Streep', u'Jane Al...
298 8.0 The Graduate APPROVED Comedy 106 [u'Dustin Hoffman', u'Anne Bancroft', u'Kathar...
866 7.5 To Catch a Thief APPROVED Mystery 106 [u'Cary Grant', u'Grace Kelly', u'Jessie Royce...
230 8.1 Cat on a Hot Tin Roof APPROVED Drama 108 [u'Elizabeth Taylor', u'Paul Newman', u'Burl I...
350 8.0 Shadow of a Doubt APPROVED Thriller 108 [u'Teresa Wright', u'Joseph Cotten', u'Macdona...
329 8.0 In the Heat of the Night APPROVED Crime 109 [u'Sidney Poitier', u'Rod Steiger', u'Warren O...
563 7.8 Goldfinger APPROVED Action 110 [u'Sean Connery', u'Gert Fr\xf6be', u'Honor Bl...
410 7.9 Bonnie and Clyde APPROVED Biography 111 [u'Warren Beatty', u'Faye Dunaway', u'Michael ...
38 8.6 Rear Window APPROVED Mystery 112 [u'James Stewart', u'Grace Kelly', u'Wendell C...
253 8.1 The Big Sleep APPROVED Crime 114 [u'Humphrey Bogart', u'Lauren Bacall', u'John ...
896 7.5 From Russia with Love APPROVED Action 115 [u'Sean Connery', u'Robert Shaw', u'Lotte Lenya']
65 8.4 Witness for the Prosecution APPROVED Crime 116 [u'Tyrone Power', u'Marlene Dietrich', u'Charl...
... ... ... ... ... ... ...
652 7.7 This Is England UNRATED Crime 101 [u'Thomas Turgoose', u'Stephen Graham', u'Jo H...
144 8.2 It Happened One Night UNRATED Comedy 105 [u'Clark Gable', u'Claudette Colbert', u'Walte...
291 8.1 The Return UNRATED Drama 105 [u'Vladimir Garin', u'Ivan Dobronravov', u'Kon...
523 7.8 Repulsion UNRATED Drama 105 [u'Catherine Deneuve', u'Ian Hendry', u'John F...
123 8.3 The General UNRATED Action 107 [u'Buster Keaton', u'Marion Mack', u'Glen Cave...
472 7.8 Guess Who's Coming to Dinner UNRATED Comedy 108 [u'Spencer Tracy', u'Sidney Poitier', u'Kathar...
757 7.6 Funny Games UNRATED Crime 108 [u'Susanne Lothar', u'Ulrich M\xfche', u'Arno ...
96 8.4 Yojimbo UNRATED Action 110 [u'Toshir\xf4 Mifune', u'Eijir\xf4 T\xf4no', u...
687 7.7 Spellbound UNRATED Film-Noir 111 [u'Ingrid Bergman', u'Gregory Peck', u'Michael...
582 7.8 Sabrina UNRATED Comedy 113 [u'Humphrey Bogart', u'Audrey Hepburn', u'Will...
823 7.5 Straw Dogs UNRATED Crime 113 [u'Dustin Hoffman', u'Susan George', u'Peter V...
239 8.1 Elite Squad: The Enemy Within UNRATED Action 115 [u'Wagner Moura', u'Irandhir Santos', u'Andr\x...
161 8.2 Diabolique UNRATED Horror 116 [u'Simone Signoret', u'V\xe9ra Clouzot', u'Pau...
317 8.0 Dawn of the Dead UNRATED Horror 127 [u'David Emge', u'Ken Foree', u'Scott H. Reini...
802 7.6 Tell No One UNRATED Crime 131 [u'Fran\xe7ois Cluzet', u'Marie-Jos\xe9e Croze...
217 8.1 Memories of Murder UNRATED Crime 132 [u'Kang-ho Song', u'Sang-kyung Kim', u'Roe-ha ...
262 8.1 The Hustler UNRATED Drama 134 [u'Paul Newman', u'Jackie Gleason', u'Piper La...
289 8.1 All Quiet on the Western Front UNRATED Drama 136 [u'Lew Ayres', u'Louis Wolheim', u'John Wray']
526 7.8 Zulu UNRATED Drama 138 [u'Stanley Baker', u'Jack Hawkins', u'Ulla Jac...
730 7.6 West Side Story UNRATED Crime 152 [u'Natalie Wood', u'George Chakiris', u'Richar...
257 8.1 Anatomy of a Murder UNRATED Crime 160 [u'James Stewart', u'Lee Remick', u'Ben Gazzara']
120 8.3 The Great Escape UNRATED Adventure 172 [u'Steve McQueen', u'James Garner', u'Richard ...
17 8.7 Seven Samurai UNRATED Drama 207 [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
486 7.8 Evil Dead II X Comedy 84 [u'Bruce Campbell', u'Sarah Berry', u'Dan Hicks']
913 7.5 Suspiria X Horror 92 [u'Jessica Harper', u'Stefania Casini', u'Flav...
387 8.0 Midnight Cowboy X Drama 113 [u'Dustin Hoffman', u'Jon Voight', u'Sylvia Mi...
86 8.4 A Clockwork Orange X Crime 136 [u'Malcolm McDowell', u'Patrick Magee', u'Mich...
187 8.2 Butch Cassidy and the Sundance Kid NaN Biography 110 [u'Paul Newman', u'Robert Redford', u'Katharin...
936 7.4 True Grit NaN Adventure 128 [u'John Wayne', u'Kim Darby', u'Glen Campbell']
649 7.7 Where Eagles Dare NaN Action 158 [u'Richard Burton', u'Clint Eastwood', u'Mary ...

979 rows × 6 columns

How do I filter rows of a pandas DataFrame by column value?


In [72]:
movies = pd.read_csv('http://bit.ly/imdbratings')

In [73]:
movies.head()


Out[73]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....

In [74]:
movies.shape


Out[74]:
(979, 6)

In [78]:
movies[movies.duration >= 200]


Out[78]:
star_rating title content_rating genre duration actors_list
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
7 8.9 The Lord of the Rings: The Return of the King PG-13 Adventure 201 [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
17 8.7 Seven Samurai UNRATED Drama 207 [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
78 8.4 Once Upon a Time in America R Crime 229 [u'Robert De Niro', u'James Woods', u'Elizabet...
85 8.4 Lawrence of Arabia PG Adventure 216 [u"Peter O'Toole", u'Alec Guinness', u'Anthony...
142 8.3 Lagaan: Once Upon a Time in India PG Adventure 224 [u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...
157 8.2 Gone with the Wind G Drama 238 [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
204 8.1 Ben-Hur G Adventure 212 [u'Charlton Heston', u'Jack Hawkins', u'Stephe...
445 7.9 The Ten Commandments APPROVED Adventure 220 [u'Charlton Heston', u'Yul Brynner', u'Anne Ba...
476 7.8 Hamlet PG-13 Drama 242 [u'Kenneth Branagh', u'Julie Christie', u'Dere...
630 7.7 Malcolm X PG-13 Biography 202 [u'Denzel Washington', u'Angela Bassett', u'De...
767 7.6 It's a Mad, Mad, Mad, Mad World APPROVED Action 205 [u'Spencer Tracy', u'Milton Berle', u'Ethel Me...

Tip: Filtered DataFrame is also a DataFrame


In [79]:
movies.loc[movies.duration >= 200, 'genre']


Out[79]:
2          Crime
7      Adventure
17         Drama
78         Crime
85     Adventure
142    Adventure
157        Drama
204    Adventure
445    Adventure
476        Drama
630    Biography
767       Action
Name: genre, dtype: object

How do I apply multiple filter criteria to a pandas DataFrame?


In [80]:
movies = pd.read_csv('http://bit.ly/imdbratings')

In [81]:
movies.head()


Out[81]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....

In [82]:
movies[movies.duration >= 200]


Out[82]:
star_rating title content_rating genre duration actors_list
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
7 8.9 The Lord of the Rings: The Return of the King PG-13 Adventure 201 [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
17 8.7 Seven Samurai UNRATED Drama 207 [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
78 8.4 Once Upon a Time in America R Crime 229 [u'Robert De Niro', u'James Woods', u'Elizabet...
85 8.4 Lawrence of Arabia PG Adventure 216 [u"Peter O'Toole", u'Alec Guinness', u'Anthony...
142 8.3 Lagaan: Once Upon a Time in India PG Adventure 224 [u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...
157 8.2 Gone with the Wind G Drama 238 [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
204 8.1 Ben-Hur G Adventure 212 [u'Charlton Heston', u'Jack Hawkins', u'Stephe...
445 7.9 The Ten Commandments APPROVED Adventure 220 [u'Charlton Heston', u'Yul Brynner', u'Anne Ba...
476 7.8 Hamlet PG-13 Drama 242 [u'Kenneth Branagh', u'Julie Christie', u'Dere...
630 7.7 Malcolm X PG-13 Biography 202 [u'Denzel Washington', u'Angela Bassett', u'De...
767 7.6 It's a Mad, Mad, Mad, Mad World APPROVED Action 205 [u'Spencer Tracy', u'Milton Berle', u'Ethel Me...

In [88]:
movies[(movies.duration >= 200) & (movies.genre == 'Drama')]


Out[88]:
star_rating title content_rating genre duration actors_list
17 8.7 Seven Samurai UNRATED Drama 207 [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
157 8.2 Gone with the Wind G Drama 238 [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
476 7.8 Hamlet PG-13 Drama 242 [u'Kenneth Branagh', u'Julie Christie', u'Dere...

Tip: Multiple condition on a single column


In [89]:
movies[movies.genre.isin(['Crime', 'Drama', 'Action'])]


Out[89]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....
5 8.9 12 Angry Men NOT RATED Drama 96 [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
9 8.9 Fight Club R Drama 139 [u'Brad Pitt', u'Edward Norton', u'Helena Bonh...
11 8.8 Inception PG-13 Action 148 [u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'...
12 8.8 Star Wars: Episode V - The Empire Strikes Back PG Action 124 [u'Mark Hamill', u'Harrison Ford', u'Carrie Fi...
13 8.8 Forrest Gump PG-13 Drama 142 [u'Tom Hanks', u'Robin Wright', u'Gary Sinise']
16 8.7 One Flew Over the Cuckoo's Nest R Drama 133 [u'Jack Nicholson', u'Louise Fletcher', u'Mich...
17 8.7 Seven Samurai UNRATED Drama 207 [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
19 8.7 Star Wars PG Action 121 [u'Mark Hamill', u'Harrison Ford', u'Carrie Fi...
20 8.7 The Matrix R Action 136 [u'Keanu Reeves', u'Laurence Fishburne', u'Car...
21 8.7 City of God R Crime 130 [u'Alexandre Rodrigues', u'Matheus Nachtergael...
22 8.7 It's a Wonderful Life APPROVED Drama 130 [u'James Stewart', u'Donna Reed', u'Lionel Bar...
23 8.7 The Usual Suspects R Crime 106 [u'Kevin Spacey', u'Gabriel Byrne', u'Chazz Pa...
24 8.7 Se7en R Drama 127 [u'Morgan Freeman', u'Brad Pitt', u'Kevin Spac...
27 8.6 The Silence of the Lambs R Drama 118 [u'Jodie Foster', u'Anthony Hopkins', u'Lawren...
28 8.6 Leon: The Professional R Crime 110 [u'Jean Reno', u'Gary Oldman', u'Natalie Portm...
32 8.6 Casablanca PG Drama 102 [u'Humphrey Bogart', u'Ingrid Bergman', u'Paul...
33 8.6 Whiplash R Drama 107 [u'Miles Teller', u'J.K. Simmons', u'Melissa B...
34 8.6 American History X R Crime 119 [u'Edward Norton', u'Edward Furlong', u"Beverl...
36 8.6 Saving Private Ryan R Action 169 [u'Tom Hanks', u'Matt Damon', u'Tom Sizemore']
37 8.6 Raiders of the Lost Ark PG Action 115 [u'Harrison Ford', u'Karen Allen', u'Paul Free...
40 8.5 The Green Mile R Crime 189 [u'Tom Hanks', u'Michael Clarke Duncan', u'Dav...
41 8.5 Sunset Blvd. NOT RATED Drama 110 [u'William Holden', u'Gloria Swanson', u'Erich...
43 8.5 The Dark Knight Rises PG-13 Action 165 [u'Christian Bale', u'Tom Hardy', u'Anne Hatha...
44 8.5 Gladiator R Action 155 [u'Russell Crowe', u'Joaquin Phoenix', u'Conni...
45 8.5 Terminator 2: Judgment Day R Action 137 [u'Arnold Schwarzenegger', u'Linda Hamilton', ...
... ... ... ... ... ... ...
916 7.5 Up in the Air R Drama 109 [u'George Clooney', u'Vera Farmiga', u'Anna Ke...
917 7.5 Begin Again R Drama 104 [u'Keira Knightley', u'Mark Ruffalo', u'Adam L...
918 7.5 Running Scared R Action 122 [u'Paul Walker', u'Cameron Bright', u'Chazz Pa...
920 7.5 Witness R Crime 112 [u'Harrison Ford', u'Kelly McGillis', u'Lukas ...
922 7.5 Mud PG-13 Drama 130 [u'Matthew McConaughey', u'Tye Sheridan', u'Ja...
923 7.5 Across the Universe PG-13 Drama 133 [u'Evan Rachel Wood', u'Jim Sturgess', u'Joe A...
924 7.5 Les Miserables PG-13 Crime 134 [u'Liam Neeson', u'Geoffrey Rush', u'Uma Thurm...
925 7.5 Notes on a Scandal R Drama 92 [u'Cate Blanchett', u'Judi Dench', u'Andrew Si...
926 7.5 Inside Llewyn Davis R Drama 104 [u'Oscar Isaac', u'Carey Mulligan', u'John Goo...
927 7.5 Brick R Crime 110 [u'Joseph Gordon-Levitt', u'Lukas Haas', u'Emi...
931 7.4 Mean Streets R Crime 112 [u'Robert De Niro', u'Harvey Keitel', u'David ...
939 7.4 Predestination R Drama 97 [u'Ethan Hawke', u'Sarah Snook', u'Noah Taylor']
941 7.4 A Bridge Too Far PG Drama 175 [u'Sean Connery', u"Ryan O'Neal", u'Michael Ca...
945 7.4 Take Shelter R Drama 120 [u'Michael Shannon', u'Jessica Chastain', u'Sh...
946 7.4 Far from Heaven PG-13 Drama 107 [u'Julianne Moore', u'Dennis Quaid', u'Dennis ...
947 7.4 Eraserhead UNRATED Drama 89 [u'Jack Nance', u'Charlotte Stewart', u'Allen ...
950 7.4 Bound R Crime 108 [u'Jennifer Tilly', u'Gina Gershon', u'Joe Pan...
951 7.4 Sleepy Hollow R Drama 105 [u'Johnny Depp', u'Christina Ricci', u'Miranda...
954 7.4 X-Men PG-13 Action 104 [u'Patrick Stewart', u'Hugh Jackman', u'Ian Mc...
955 7.4 Zero Dark Thirty R Drama 157 [u'Jessica Chastain', u'Joel Edgerton', u'Chri...
958 7.4 My Sister's Keeper PG-13 Drama 109 [u'Cameron Diaz', u'Abigail Breslin', u'Alec B...
963 7.4 La Femme Nikita R Action 118 [u'Anne Parillaud', u'Marc Duret', u'Patrick F...
967 7.4 The Rock R Action 136 [u'Sean Connery', u'Nicolas Cage', u'Ed Harris']
968 7.4 The English Patient R Drama 162 [u'Ralph Fiennes', u'Juliette Binoche', u'Will...
969 7.4 Law Abiding Citizen R Crime 109 [u'Gerard Butler', u'Jamie Foxx', u'Leslie Bibb']
970 7.4 Wonder Boys R Drama 107 [u'Michael Douglas', u'Tobey Maguire', u'Franc...
972 7.4 Blue Valentine NC-17 Drama 112 [u'Ryan Gosling', u'Michelle Williams', u'John...
973 7.4 The Cider House Rules PG-13 Drama 126 [u'Tobey Maguire', u'Charlize Theron', u'Micha...
976 7.4 Master and Commander: The Far Side of the World PG-13 Action 138 [u'Russell Crowe', u'Paul Bettany', u'Billy Bo...
978 7.4 Wall Street R Crime 126 [u'Charlie Sheen', u'Michael Douglas', u'Tamar...

538 rows × 6 columns

How do I change the data type of a pandas Series?


In [90]:
drinks = pd.read_csv('http://bit.ly/drinksbycountry')

In [92]:
drinks.dtypes


Out[92]:
country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

In [93]:
drinks.beer_servings = drinks.beer_servings.astype(float)

In [94]:
drinks.dtypes


Out[94]:
country                          object
beer_servings                   float64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

In [95]:
drinks = pd.read_csv('http://bit.ly/drinksbycountry', dtype={'beer_servings': float})

In [96]:
drinks.dtypes


Out[96]:
country                          object
beer_servings                   float64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

In [97]:
orders = pd.read_table('http://bit.ly/chiporders')

In [98]:
orders.head()


Out[98]:
order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
1 1 1 Izze [Clementine] $3.39
2 1 1 Nantucket Nectar [Apple] $3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98

In [102]:
orders.dtypes


Out[102]:
order_id                int64
quantity                int64
item_name              object
choice_description     object
item_price            float64
dtype: object

In [100]:
orders.item_price = orders.item_price.str.replace('$', '').astype(float)

In [101]:
orders.head()


Out[101]:
order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN 2.39
1 1 1 Izze [Clementine] 3.39
2 1 1 Nantucket Nectar [Apple] 3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN 2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... 16.98

In [105]:
orders.item_name.str.contains('Chicken').astype(int).head()


Out[105]:
0    0
1    0
2    0
3    0
4    1
Name: item_name, dtype: int64

How do I explore a pandas Series?


In [106]:
movies = pd.read_csv('http://bit.ly/imdbratings')

In [107]:
movies.head()


Out[107]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....

In [108]:
movies.dtypes


Out[108]:
star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        object
dtype: object

In [110]:
movies.genre.describe()


Out[110]:
count       979
unique       16
top       Drama
freq        278
Name: genre, dtype: object

In [111]:
movies.genre.value_counts()


Out[111]:
Drama        278
Comedy       156
Action       136
Crime        124
Biography     77
Adventure     75
Animation     62
Horror        29
Mystery       16
Western        9
Thriller       5
Sci-Fi         5
Film-Noir      3
Family         2
Fantasy        1
History        1
Name: genre, dtype: int64

In [112]:
movies.genre.value_counts(normalize=True)


Out[112]:
Drama        0.283963
Comedy       0.159346
Action       0.138917
Crime        0.126660
Biography    0.078652
Adventure    0.076609
Animation    0.063330
Horror       0.029622
Mystery      0.016343
Western      0.009193
Thriller     0.005107
Sci-Fi       0.005107
Film-Noir    0.003064
Family       0.002043
Fantasy      0.001021
History      0.001021
Name: genre, dtype: float64

In [113]:
movies.genre.unique()


Out[113]:
array(['Crime', 'Action', 'Drama', 'Western', 'Adventure', 'Biography',
       'Comedy', 'Animation', 'Mystery', 'Horror', 'Film-Noir', 'Sci-Fi',
       'History', 'Thriller', 'Family', 'Fantasy'], dtype=object)

In [114]:
movies.genre.nunique()


Out[114]:
16

In [116]:
pd.crosstab(movies.genre, movies.content_rating)


Out[116]:
content_rating APPROVED G GP NC-17 NOT RATED PASSED PG PG-13 R TV-MA UNRATED X
genre
Action 3 1 1 0 4 1 11 44 67 0 3 0
Adventure 3 2 0 0 5 1 21 23 17 0 2 0
Animation 3 20 0 0 3 0 25 5 5 0 1 0
Biography 1 2 1 0 1 0 6 29 36 0 0 0
Comedy 9 2 1 1 16 3 23 23 73 0 4 1
Crime 6 0 0 1 7 1 6 4 87 0 11 1
Drama 12 3 0 4 24 1 25 55 143 1 9 1
Family 0 1 0 0 0 0 1 0 0 0 0 0
Fantasy 0 0 0 0 0 0 0 0 1 0 0 0
Film-Noir 1 0 0 0 1 0 0 0 0 0 1 0
History 0 0 0 0 0 0 0 0 0 0 1 0
Horror 2 0 0 1 1 0 1 2 16 0 5 1
Mystery 4 1 0 0 1 0 1 2 6 0 1 0
Sci-Fi 1 0 0 0 0 0 0 1 3 0 0 0
Thriller 1 0 0 0 0 0 1 0 3 0 0 0
Western 1 0 0 0 2 0 2 1 3 0 0 0

In [117]:
movies.duration.describe()


Out[117]:
count    979.000000
mean     120.979571
std       26.218010
min       64.000000
25%      102.000000
50%      117.000000
75%      134.000000
max      242.000000
Name: duration, dtype: float64

In [119]:
movies.duration.mean()


Out[119]:
120.97957099080695

Tip: Visualisation


In [120]:
%matplotlib inline

In [121]:
movies.duration.plot(kind='hist')


Out[121]:
<matplotlib.axes._subplots.AxesSubplot at 0x116aea150>

In [122]:
movies.genre.value_counts().plot(kind='bar')


Out[122]:
<matplotlib.axes._subplots.AxesSubplot at 0x116b06c90>

How do I handle missing values in pandas?


In [123]:
ufo = pd.read_csv('http://bit.ly/uforeports')

In [124]:
ufo.tail()


Out[124]:
City Colors Reported Shape Reported State Time
18236 Grant Park NaN TRIANGLE IL 12/31/2000 23:00
18237 Spirit Lake NaN DISK IA 12/31/2000 23:00
18238 Eagle River NaN NaN WI 12/31/2000 23:45
18239 Eagle River RED LIGHT WI 12/31/2000 23:45
18240 Ybor NaN OVAL FL 12/31/2000 23:59

In [126]:
ufo.isnull().tail()


Out[126]:
City Colors Reported Shape Reported State Time
18236 False True False False False
18237 False True False False False
18238 False True True False False
18239 False False False False False
18240 False True False False False

In [127]:
ufo.notnull().tail()


Out[127]:
City Colors Reported Shape Reported State Time
18236 True False True True True
18237 True False True True True
18238 True False False True True
18239 True True True True True
18240 True False True True True

In [129]:
ufo.isnull().sum()


Out[129]:
City                  25
Colors Reported    15359
Shape Reported      2644
State                  0
Time                   0
dtype: int64

In [131]:
ufo[ufo.City.isnull()]


Out[131]:
City Colors Reported Shape Reported State Time
21 NaN NaN NaN LA 8/15/1943 0:00
22 NaN NaN LIGHT LA 8/15/1943 0:00
204 NaN NaN DISK CA 7/15/1952 12:30
241 NaN BLUE DISK MT 7/4/1953 14:00
613 NaN NaN DISK NV 7/1/1960 12:00
1877 NaN YELLOW CIRCLE AZ 8/15/1969 1:00
2013 NaN NaN NaN NH 8/1/1970 9:30
2546 NaN NaN FIREBALL OH 10/25/1973 23:30
3123 NaN RED TRIANGLE WV 11/25/1975 23:00
4736 NaN NaN SPHERE CA 6/23/1982 23:00
5269 NaN NaN NaN AZ 6/30/1985 21:30
6735 NaN NaN FORMATION TX 4/1/1992 2:00
7208 NaN NaN CIRCLE MI 10/4/1993 17:30
8828 NaN NaN TRIANGLE WA 10/30/1995 21:30
8967 NaN NaN VARIOUS CA 12/8/1995 18:00
9273 NaN NaN TRIANGLE OH 5/1/1996 3:00
9388 NaN NaN OVAL CA 6/12/1996 12:00
9587 NaN NaN EGG FL 8/24/1996 15:00
10399 NaN NaN TRIANGLE IL 6/15/1997 23:00
11625 NaN NaN CIRCLE TX 6/7/1998 7:00
12441 NaN RED FIREBALL WA 10/26/1998 17:58
15767 NaN NaN RECTANGLE NV 1/21/2000 11:30
15812 NaN NaN LIGHT NV 2/2/2000 3:00
16054 NaN GREEN NaN FL 3/11/2000 3:30
16608 NaN NaN SPHERE NY 6/15/2000 15:00

In [132]:
ufo.shape


Out[132]:
(18241, 5)

In [133]:
ufo.dropna(how='any').shape


Out[133]:
(2486, 5)

In [134]:
ufo.dropna(how='all').shape


Out[134]:
(18241, 5)

In [135]:
ufo.dropna(subset=['City', 'Shape Reported'], how='any').shape


Out[135]:
(15576, 5)

In [137]:
ufo['Shape Reported'].value_counts(dropna=False)


Out[137]:
LIGHT        2803
NaN          2644
DISK         2122
TRIANGLE     1889
OTHER        1402
CIRCLE       1365
SPHERE       1054
FIREBALL     1039
OVAL          845
CIGAR         617
FORMATION     434
VARIOUS       333
RECTANGLE     303
CYLINDER      294
CHEVRON       248
DIAMOND       234
EGG           197
FLASH         188
TEARDROP      119
CONE           60
CROSS          36
DELTA           7
ROUND           2
CRESCENT        2
DOME            1
PYRAMID         1
HEXAGON         1
FLARE           1
Name: Shape Reported, dtype: int64

In [138]:
ufo['Shape Reported'].fillna(value='VARIOUS', inplace=True)

In [139]:
ufo['Shape Reported'].value_counts()


Out[139]:
VARIOUS      2977
LIGHT        2803
DISK         2122
TRIANGLE     1889
OTHER        1402
CIRCLE       1365
SPHERE       1054
FIREBALL     1039
OVAL          845
CIGAR         617
FORMATION     434
RECTANGLE     303
CYLINDER      294
CHEVRON       248
DIAMOND       234
EGG           197
FLASH         188
TEARDROP      119
CONE           60
CROSS          36
DELTA           7
ROUND           2
CRESCENT        2
PYRAMID         1
HEXAGON         1
DOME            1
FLARE           1
Name: Shape Reported, dtype: int64

How do I apply a function to a pandas Series or a DataFrame?


In [3]:
train = pd.read_csv('http://bit.ly/kaggletrain')

In [4]:
train.head()


Out[4]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

In [5]:
train['Sex_num'] = train.Sex.map({'female': 0, 'male': 1})

In [6]:
train.loc[0:4, ['Sex', 'Sex_num']]


Out[6]:
Sex Sex_num
0 male 1
1 female 0
2 female 0
3 female 0
4 male 1

In [7]:
train['Name_length'] = train.Name.apply(len)

In [9]:
train.loc[0:4, ['Name', 'Name_length']]


Out[9]:
Name Name_length
0 Braund, Mr. Owen Harris 23
1 Cumings, Mrs. John Bradley (Florence Briggs Th... 51
2 Heikkinen, Miss. Laina 22
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 44
4 Allen, Mr. William Henry 24

In [10]:
import numpy as np

In [11]:
train['Fare_ceil'] = train.Fare.apply(np.ceil)

In [12]:
train.loc[0:4, ['Fare', 'Fare_ceil']]


Out[12]:
Fare Fare_ceil
0 7.2500 8.0
1 71.2833 72.0
2 7.9250 8.0
3 53.1000 54.0
4 8.0500 9.0

In [14]:
train.Name.str.split(',').head()


Out[14]:
0                           [Braund,  Mr. Owen Harris]
1    [Cumings,  Mrs. John Bradley (Florence Briggs ...
2                            [Heikkinen,  Miss. Laina]
3      [Futrelle,  Mrs. Jacques Heath (Lily May Peel)]
4                          [Allen,  Mr. William Henry]
Name: Name, dtype: object

In [15]:
def get_element(my_list, position):
    return my_list[position]

In [16]:
train['Last_name'] = train.Name.str.split(',').apply(get_element, position=0)

In [19]:
train['Last_name'] = train.Name.str.split(',').apply(lambda x: x[0])

In [20]:
train.loc[0:4, ['Name', 'Last_name']]


Out[20]:
Name Last_name
0 Braund, Mr. Owen Harris Braund
1 Cumings, Mrs. John Bradley (Florence Briggs Th... Cumings
2 Heikkinen, Miss. Laina Heikkinen
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) Futrelle
4 Allen, Mr. William Henry Allen

In [22]:
drinks = pd.read_csv('http://bit.ly/drinksbycountry')
drinks.head()


Out[22]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa

In [27]:
drinks.loc[:, 'beer_servings':'wine_servings'].apply(np.argmax, axis=1)


Out[27]:
0        beer_servings
1      spirit_servings
2        beer_servings
3        wine_servings
4        beer_servings
5      spirit_servings
6        wine_servings
7      spirit_servings
8        beer_servings
9        beer_servings
10     spirit_servings
11     spirit_servings
12     spirit_servings
13       beer_servings
14     spirit_servings
15     spirit_servings
16       beer_servings
17       beer_servings
18       beer_servings
19       beer_servings
20       beer_servings
21     spirit_servings
22       beer_servings
23       beer_servings
24       beer_servings
25     spirit_servings
26       beer_servings
27       beer_servings
28       beer_servings
29       beer_servings
            ...       
163    spirit_servings
164      beer_servings
165      wine_servings
166      wine_servings
167    spirit_servings
168    spirit_servings
169    spirit_servings
170      beer_servings
171      wine_servings
172      beer_servings
173      beer_servings
174      beer_servings
175      beer_servings
176      beer_servings
177    spirit_servings
178    spirit_servings
179      beer_servings
180    spirit_servings
181    spirit_servings
182      beer_servings
183      beer_servings
184      beer_servings
185      wine_servings
186    spirit_servings
187      beer_servings
188      beer_servings
189      beer_servings
190      beer_servings
191      beer_servings
192      beer_servings
dtype: object

In [28]:
drinks.loc[:, 'beer_servings':'wine_servings'].applymap(float)


Out[28]:
beer_servings spirit_servings wine_servings
0 0.0 0.0 0.0
1 89.0 132.0 54.0
2 25.0 0.0 14.0
3 245.0 138.0 312.0
4 217.0 57.0 45.0
5 102.0 128.0 45.0
6 193.0 25.0 221.0
7 21.0 179.0 11.0
8 261.0 72.0 212.0
9 279.0 75.0 191.0
10 21.0 46.0 5.0
11 122.0 176.0 51.0
12 42.0 63.0 7.0
13 0.0 0.0 0.0
14 143.0 173.0 36.0
15 142.0 373.0 42.0
16 295.0 84.0 212.0
17 263.0 114.0 8.0
18 34.0 4.0 13.0
19 23.0 0.0 0.0
20 167.0 41.0 8.0
21 76.0 173.0 8.0
22 173.0 35.0 35.0
23 245.0 145.0 16.0
24 31.0 2.0 1.0
25 231.0 252.0 94.0
26 25.0 7.0 7.0
27 88.0 0.0 0.0
28 37.0 1.0 7.0
29 144.0 56.0 16.0
... ... ... ...
163 128.0 178.0 7.0
164 90.0 2.0 2.0
165 152.0 60.0 186.0
166 185.0 100.0 280.0
167 5.0 35.0 16.0
168 2.0 15.0 0.0
169 99.0 258.0 1.0
170 106.0 27.0 86.0
171 1.0 1.0 4.0
172 36.0 2.0 19.0
173 36.0 21.0 5.0
174 197.0 156.0 7.0
175 51.0 3.0 20.0
176 51.0 22.0 7.0
177 19.0 71.0 32.0
178 6.0 41.0 9.0
179 45.0 9.0 0.0
180 206.0 237.0 45.0
181 16.0 135.0 5.0
182 219.0 126.0 195.0
183 36.0 6.0 1.0
184 249.0 158.0 84.0
185 115.0 35.0 220.0
186 25.0 101.0 8.0
187 21.0 18.0 11.0
188 333.0 100.0 3.0
189 111.0 2.0 1.0
190 6.0 0.0 0.0
191 32.0 19.0 4.0
192 64.0 18.0 4.0

193 rows × 3 columns

How do I avoid a SettingWithCopyWarning in pandas?


In [31]:
movies = pd.read_csv('http://bit.ly/imdbratings')
movies.head()


Out[31]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....

In [33]:
movies.content_rating.isnull().sum()


Out[33]:
3

In [34]:
movies[movies.content_rating.isnull()]


Out[34]:
star_rating title content_rating genre duration actors_list
187 8.2 Butch Cassidy and the Sundance Kid NaN Biography 110 [u'Paul Newman', u'Robert Redford', u'Katharin...
649 7.7 Where Eagles Dare NaN Action 158 [u'Richard Burton', u'Clint Eastwood', u'Mary ...
936 7.4 True Grit NaN Adventure 128 [u'John Wayne', u'Kim Darby', u'Glen Campbell']

In [35]:
movies.content_rating.value_counts()


Out[35]:
R            460
PG-13        189
PG           123
NOT RATED     65
APPROVED      47
UNRATED       38
G             32
PASSED         7
NC-17          7
X              4
GP             3
TV-MA          1
Name: content_rating, dtype: int64

In [38]:
movies.loc[movies.content_rating == 'NOT RATED', 'content_rating'] = np.nan

In [39]:
movies.content_rating.isnull().sum()


Out[39]:
68

In [41]:
top_movies = movies.loc[movies.star_rating >= 9, :]
top_movies


Out[41]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...

In [42]:
top_movies.loc[0, 'duration'] = 150


/usr/local/lib/python2.7/site-packages/pandas/core/indexing.py:476: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s

In [43]:
top_movies


Out[43]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 150 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...

In [44]:
top_movies = movies.loc[movies.star_rating >= 9, :].copy()
top_movies


Out[44]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...

In [46]:
top_movies.loc[0, 'duration'] = 150
top_movies


Out[46]:
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 150 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...

How do I change display options in pandas?


In [48]:
drinks = pd.read_csv('http://bit.ly/drinksbycountry')
drinks


Out[48]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
5 Antigua & Barbuda 102 128 45 4.9 North America
6 Argentina 193 25 221 8.3 South America
7 Armenia 21 179 11 3.8 Europe
8 Australia 261 72 212 10.4 Oceania
9 Austria 279 75 191 9.7 Europe
10 Azerbaijan 21 46 5 1.3 Europe
11 Bahamas 122 176 51 6.3 North America
12 Bahrain 42 63 7 2.0 Asia
13 Bangladesh 0 0 0 0.0 Asia
14 Barbados 143 173 36 6.3 North America
15 Belarus 142 373 42 14.4 Europe
16 Belgium 295 84 212 10.5 Europe
17 Belize 263 114 8 6.8 North America
18 Benin 34 4 13 1.1 Africa
19 Bhutan 23 0 0 0.4 Asia
20 Bolivia 167 41 8 3.8 South America
21 Bosnia-Herzegovina 76 173 8 4.6 Europe
22 Botswana 173 35 35 5.4 Africa
23 Brazil 245 145 16 7.2 South America
24 Brunei 31 2 1 0.6 Asia
25 Bulgaria 231 252 94 10.3 Europe
26 Burkina Faso 25 7 7 4.3 Africa
27 Burundi 88 0 0 6.3 Africa
28 Cote d'Ivoire 37 1 7 4.0 Africa
29 Cabo Verde 144 56 16 4.0 Africa
... ... ... ... ... ... ...
163 Suriname 128 178 7 5.6 South America
164 Swaziland 90 2 2 4.7 Africa
165 Sweden 152 60 186 7.2 Europe
166 Switzerland 185 100 280 10.2 Europe
167 Syria 5 35 16 1.0 Asia
168 Tajikistan 2 15 0 0.3 Asia
169 Thailand 99 258 1 6.4 Asia
170 Macedonia 106 27 86 3.9 Europe
171 Timor-Leste 1 1 4 0.1 Asia
172 Togo 36 2 19 1.3 Africa
173 Tonga 36 21 5 1.1 Oceania
174 Trinidad & Tobago 197 156 7 6.4 North America
175 Tunisia 51 3 20 1.3 Africa
176 Turkey 51 22 7 1.4 Asia
177 Turkmenistan 19 71 32 2.2 Asia
178 Tuvalu 6 41 9 1.0 Oceania
179 Uganda 45 9 0 8.3 Africa
180 Ukraine 206 237 45 8.9 Europe
181 United Arab Emirates 16 135 5 2.8 Asia
182 United Kingdom 219 126 195 10.4 Europe
183 Tanzania 36 6 1 5.7 Africa
184 USA 249 158 84 8.7 North America
185 Uruguay 115 35 220 6.6 South America
186 Uzbekistan 25 101 8 2.4 Asia
187 Vanuatu 21 18 11 0.9 Oceania
188 Venezuela 333 100 3 7.7 South America
189 Vietnam 111 2 1 2.0 Asia
190 Yemen 6 0 0 0.1 Asia
191 Zambia 32 19 4 2.5 Africa
192 Zimbabwe 64 18 4 4.7 Africa

193 rows × 6 columns


In [49]:
pd.get_option('display.max_rows')


Out[49]:
60

In [50]:
pd.set_option('display.max_rows', None)

In [51]:
drinks


Out[51]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
5 Antigua & Barbuda 102 128 45 4.9 North America
6 Argentina 193 25 221 8.3 South America
7 Armenia 21 179 11 3.8 Europe
8 Australia 261 72 212 10.4 Oceania
9 Austria 279 75 191 9.7 Europe
10 Azerbaijan 21 46 5 1.3 Europe
11 Bahamas 122 176 51 6.3 North America
12 Bahrain 42 63 7 2.0 Asia
13 Bangladesh 0 0 0 0.0 Asia
14 Barbados 143 173 36 6.3 North America
15 Belarus 142 373 42 14.4 Europe
16 Belgium 295 84 212 10.5 Europe
17 Belize 263 114 8 6.8 North America
18 Benin 34 4 13 1.1 Africa
19 Bhutan 23 0 0 0.4 Asia
20 Bolivia 167 41 8 3.8 South America
21 Bosnia-Herzegovina 76 173 8 4.6 Europe
22 Botswana 173 35 35 5.4 Africa
23 Brazil 245 145 16 7.2 South America
24 Brunei 31 2 1 0.6 Asia
25 Bulgaria 231 252 94 10.3 Europe
26 Burkina Faso 25 7 7 4.3 Africa
27 Burundi 88 0 0 6.3 Africa
28 Cote d'Ivoire 37 1 7 4.0 Africa
29 Cabo Verde 144 56 16 4.0 Africa
30 Cambodia 57 65 1 2.2 Asia
31 Cameroon 147 1 4 5.8 Africa
32 Canada 240 122 100 8.2 North America
33 Central African Republic 17 2 1 1.8 Africa
34 Chad 15 1 1 0.4 Africa
35 Chile 130 124 172 7.6 South America
36 China 79 192 8 5.0 Asia
37 Colombia 159 76 3 4.2 South America
38 Comoros 1 3 1 0.1 Africa
39 Congo 76 1 9 1.7 Africa
40 Cook Islands 0 254 74 5.9 Oceania
41 Costa Rica 149 87 11 4.4 North America
42 Croatia 230 87 254 10.2 Europe
43 Cuba 93 137 5 4.2 North America
44 Cyprus 192 154 113 8.2 Europe
45 Czech Republic 361 170 134 11.8 Europe
46 North Korea 0 0 0 0.0 Asia
47 DR Congo 32 3 1 2.3 Africa
48 Denmark 224 81 278 10.4 Europe
49 Djibouti 15 44 3 1.1 Africa
50 Dominica 52 286 26 6.6 North America
51 Dominican Republic 193 147 9 6.2 North America
52 Ecuador 162 74 3 4.2 South America
53 Egypt 6 4 1 0.2 Africa
54 El Salvador 52 69 2 2.2 North America
55 Equatorial Guinea 92 0 233 5.8 Africa
56 Eritrea 18 0 0 0.5 Africa
57 Estonia 224 194 59 9.5 Europe
58 Ethiopia 20 3 0 0.7 Africa
59 Fiji 77 35 1 2.0 Oceania
60 Finland 263 133 97 10.0 Europe
61 France 127 151 370 11.8 Europe
62 Gabon 347 98 59 8.9 Africa
63 Gambia 8 0 1 2.4 Africa
64 Georgia 52 100 149 5.4 Europe
65 Germany 346 117 175 11.3 Europe
66 Ghana 31 3 10 1.8 Africa
67 Greece 133 112 218 8.3 Europe
68 Grenada 199 438 28 11.9 North America
69 Guatemala 53 69 2 2.2 North America
70 Guinea 9 0 2 0.2 Africa
71 Guinea-Bissau 28 31 21 2.5 Africa
72 Guyana 93 302 1 7.1 South America
73 Haiti 1 326 1 5.9 North America
74 Honduras 69 98 2 3.0 North America
75 Hungary 234 215 185 11.3 Europe
76 Iceland 233 61 78 6.6 Europe
77 India 9 114 0 2.2 Asia
78 Indonesia 5 1 0 0.1 Asia
79 Iran 0 0 0 0.0 Asia
80 Iraq 9 3 0 0.2 Asia
81 Ireland 313 118 165 11.4 Europe
82 Israel 63 69 9 2.5 Asia
83 Italy 85 42 237 6.5 Europe
84 Jamaica 82 97 9 3.4 North America
85 Japan 77 202 16 7.0 Asia
86 Jordan 6 21 1 0.5 Asia
87 Kazakhstan 124 246 12 6.8 Asia
88 Kenya 58 22 2 1.8 Africa
89 Kiribati 21 34 1 1.0 Oceania
90 Kuwait 0 0 0 0.0 Asia
91 Kyrgyzstan 31 97 6 2.4 Asia
92 Laos 62 0 123 6.2 Asia
93 Latvia 281 216 62 10.5 Europe
94 Lebanon 20 55 31 1.9 Asia
95 Lesotho 82 29 0 2.8 Africa
96 Liberia 19 152 2 3.1 Africa
97 Libya 0 0 0 0.0 Africa
98 Lithuania 343 244 56 12.9 Europe
99 Luxembourg 236 133 271 11.4 Europe
100 Madagascar 26 15 4 0.8 Africa
101 Malawi 8 11 1 1.5 Africa
102 Malaysia 13 4 0 0.3 Asia
103 Maldives 0 0 0 0.0 Asia
104 Mali 5 1 1 0.6 Africa
105 Malta 149 100 120 6.6 Europe
106 Marshall Islands 0 0 0 0.0 Oceania
107 Mauritania 0 0 0 0.0 Africa
108 Mauritius 98 31 18 2.6 Africa
109 Mexico 238 68 5 5.5 North America
110 Micronesia 62 50 18 2.3 Oceania
111 Monaco 0 0 0 0.0 Europe
112 Mongolia 77 189 8 4.9 Asia
113 Montenegro 31 114 128 4.9 Europe
114 Morocco 12 6 10 0.5 Africa
115 Mozambique 47 18 5 1.3 Africa
116 Myanmar 5 1 0 0.1 Asia
117 Namibia 376 3 1 6.8 Africa
118 Nauru 49 0 8 1.0 Oceania
119 Nepal 5 6 0 0.2 Asia
120 Netherlands 251 88 190 9.4 Europe
121 New Zealand 203 79 175 9.3 Oceania
122 Nicaragua 78 118 1 3.5 North America
123 Niger 3 2 1 0.1 Africa
124 Nigeria 42 5 2 9.1 Africa
125 Niue 188 200 7 7.0 Oceania
126 Norway 169 71 129 6.7 Europe
127 Oman 22 16 1 0.7 Asia
128 Pakistan 0 0 0 0.0 Asia
129 Palau 306 63 23 6.9 Oceania
130 Panama 285 104 18 7.2 North America
131 Papua New Guinea 44 39 1 1.5 Oceania
132 Paraguay 213 117 74 7.3 South America
133 Peru 163 160 21 6.1 South America
134 Philippines 71 186 1 4.6 Asia
135 Poland 343 215 56 10.9 Europe
136 Portugal 194 67 339 11.0 Europe
137 Qatar 1 42 7 0.9 Asia
138 South Korea 140 16 9 9.8 Asia
139 Moldova 109 226 18 6.3 Europe
140 Romania 297 122 167 10.4 Europe
141 Russian Federation 247 326 73 11.5 Asia
142 Rwanda 43 2 0 6.8 Africa
143 St. Kitts & Nevis 194 205 32 7.7 North America
144 St. Lucia 171 315 71 10.1 North America
145 St. Vincent & the Grenadines 120 221 11 6.3 North America
146 Samoa 105 18 24 2.6 Oceania
147 San Marino 0 0 0 0.0 Europe
148 Sao Tome & Principe 56 38 140 4.2 Africa
149 Saudi Arabia 0 5 0 0.1 Asia
150 Senegal 9 1 7 0.3 Africa
151 Serbia 283 131 127 9.6 Europe
152 Seychelles 157 25 51 4.1 Africa
153 Sierra Leone 25 3 2 6.7 Africa
154 Singapore 60 12 11 1.5 Asia
155 Slovakia 196 293 116 11.4 Europe
156 Slovenia 270 51 276 10.6 Europe
157 Solomon Islands 56 11 1 1.2 Oceania
158 Somalia 0 0 0 0.0 Africa
159 South Africa 225 76 81 8.2 Africa
160 Spain 284 157 112 10.0 Europe
161 Sri Lanka 16 104 0 2.2 Asia
162 Sudan 8 13 0 1.7 Africa
163 Suriname 128 178 7 5.6 South America
164 Swaziland 90 2 2 4.7 Africa
165 Sweden 152 60 186 7.2 Europe
166 Switzerland 185 100 280 10.2 Europe
167 Syria 5 35 16 1.0 Asia
168 Tajikistan 2 15 0 0.3 Asia
169 Thailand 99 258 1 6.4 Asia
170 Macedonia 106 27 86 3.9 Europe
171 Timor-Leste 1 1 4 0.1 Asia
172 Togo 36 2 19 1.3 Africa
173 Tonga 36 21 5 1.1 Oceania
174 Trinidad & Tobago 197 156 7 6.4 North America
175 Tunisia 51 3 20 1.3 Africa
176 Turkey 51 22 7 1.4 Asia
177 Turkmenistan 19 71 32 2.2 Asia
178 Tuvalu 6 41 9 1.0 Oceania
179 Uganda 45 9 0 8.3 Africa
180 Ukraine 206 237 45 8.9 Europe
181 United Arab Emirates 16 135 5 2.8 Asia
182 United Kingdom 219 126 195 10.4 Europe
183 Tanzania 36 6 1 5.7 Africa
184 USA 249 158 84 8.7 North America
185 Uruguay 115 35 220 6.6 South America
186 Uzbekistan 25 101 8 2.4 Asia
187 Vanuatu 21 18 11 0.9 Oceania
188 Venezuela 333 100 3 7.7 South America
189 Vietnam 111 2 1 2.0 Asia
190 Yemen 6 0 0 0.1 Asia
191 Zambia 32 19 4 2.5 Africa
192 Zimbabwe 64 18 4 4.7 Africa

In [56]:
pd.reset_option('display.max_rows')
pd.get_option('display.max_rows')


Out[56]:
60

In [64]:
train = pd.read_csv('http://bit.ly/kaggletrain')
train


Out[64]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.25 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 0 PC 17599 71.28 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.92 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.10 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.05 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.46 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.86 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.07 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.13 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.07 NaN C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.70 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.55 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.05 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.27 NaN S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.85 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.00 NaN S
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.12 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.00 NaN S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele) female 31.0 1 0 345763 18.00 NaN S
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.22 NaN C
20 21 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.00 NaN S
21 22 1 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.00 D56 S
22 23 1 3 McGowan, Miss. Anna "Annie" female 15.0 0 0 330923 8.03 NaN Q
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.50 A6 S
24 25 0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.07 NaN S
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson) female 38.0 1 5 347077 31.39 NaN S
26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.22 NaN C
27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.00 C23 C25 C27 S
28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.88 NaN Q
29 30 0 3 Todoroff, Mr. Lalio male NaN 0 0 349216 7.90 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
861 862 0 2 Giles, Mr. Frederick Edward male 21.0 1 0 28134 11.50 NaN S
862 863 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Barron) female 48.0 0 0 17466 25.93 D17 S
863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.55 NaN S
864 865 0 2 Gill, Mr. John William male 24.0 0 0 233866 13.00 NaN S
865 866 1 2 Bystrom, Mrs. (Karolina) female 42.0 0 0 236852 13.00 NaN S
866 867 1 2 Duran y More, Miss. Asuncion female 27.0 1 0 SC/PARIS 2149 13.86 NaN C
867 868 0 1 Roebling, Mr. Washington Augustus II male 31.0 0 0 PC 17590 50.50 A24 S
868 869 0 3 van Melkebeke, Mr. Philemon male NaN 0 0 345777 9.50 NaN S
869 870 1 3 Johnson, Master. Harold Theodor male 4.0 1 1 347742 11.13 NaN S
870 871 0 3 Balkic, Mr. Cerin male 26.0 0 0 349248 7.90 NaN S
871 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 1 11751 52.55 D35 S
872 873 0 1 Carlsson, Mr. Frans Olof male 33.0 0 0 695 5.00 B51 B53 B55 S
873 874 0 3 Vander Cruyssen, Mr. Victor male 47.0 0 0 345765 9.00 NaN S
874 875 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 0 P/PP 3381 24.00 NaN C
875 876 1 3 Najib, Miss. Adele Kiamie "Jane" female 15.0 0 0 2667 7.22 NaN C
876 877 0 3 Gustafsson, Mr. Alfred Ossian male 20.0 0 0 7534 9.85 NaN S
877 878 0 3 Petroff, Mr. Nedelio male 19.0 0 0 349212 7.90 NaN S
878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.90 NaN S
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.16 C50 C
880 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 230433 26.00 NaN S
881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.90 NaN S
882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.52 NaN S
883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.50 NaN S
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.05 NaN S
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.12 NaN Q
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.00 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q

891 rows × 12 columns


In [58]:
pd.get_option('display.max_colwidth')


Out[58]:
50

In [61]:
pd.set_option('display.max_colwidth', 1000)

In [62]:
pd.get_option('display.precision')


Out[62]:
6

In [63]:
pd.set_option('display.precision', 2)

In [66]:
drinks.head()


Out[66]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa

In [70]:
drinks['x'] = drinks.wine_servings * 1000
drinks['y'] = drinks.total_litres_of_pure_alcohol * 1000
drinks.head()


Out[70]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent x y
0 Afghanistan 0 0 0 0.0 Asia 0 0.0
1 Albania 89 132 54 4.9 Europe 54000 4,900.0
2 Algeria 25 0 14 0.7 Africa 14000 700.0
3 Andorra 245 138 312 12.4 Europe 312000 12,400.0
4 Angola 217 57 45 5.9 Africa 45000 5,900.0

In [69]:
pd.set_option('display.float_format', '{:,}'.format)

In [72]:
pd.describe_option('rows')


display.max_info_rows : int or None
    df.info() will usually show null-counts for each column.
    For large frames this can be quite slow. max_info_rows and max_info_cols
    limit this null check only to frames with smaller dimensions than
    specified.
    [default: 1690785] [currently: 1690785]

display.max_rows : int
    If max_rows is exceeded, switch to truncate view. Depending on
    `large_repr`, objects are either centrally truncated or printed as
    a summary view. 'None' value means unlimited.

    In case python/IPython is running in a terminal and `large_repr`
    equals 'truncate' this can be set to 0 and pandas will auto-detect
    the height of the terminal and print a truncated object which fits
    the screen height. The IPython notebook, IPython qtconsole, or
    IDLE do not run in a terminal and hence it is not possible to do
    correct auto-detection.
    [default: 60] [currently: 60]



In [73]:
pd.reset_option('all')


height has been deprecated.

line_width has been deprecated, use display.width instead (currently both are
identical)

/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:1: FutureWarning: 
mpl_style had been deprecated and will be removed in a future version.
Use `matplotlib.pyplot.style.use` instead.

  if __name__ == '__main__':

How do I create a pandas DataFrame from another object?


In [77]:
df = pd.DataFrame({'id': [100, 101, 102], 'color': ['red', 'blue', 'red']}, columns=['id', 'color'], index=['A', 'B', 'C'])

In [79]:
pd.DataFrame([[100, 'red'], [101, 'blue'], [102, 'red']], columns=['id', 'color'])


Out[79]:
id color
0 100 red
1 101 blue
2 102 red

In [81]:
import numpy as np
arr = np.random.rand(4, 2)
arr


Out[81]:
array([[ 0.2773193 ,  0.62793445],
       [ 0.48314833,  0.66573051],
       [ 0.96897954,  0.59954869],
       [ 0.33206409,  0.06175922]])

In [84]:
pd.DataFrame(arr, columns=['one', 'two'])


Out[84]:
one two
0 0.277319 0.627934
1 0.483148 0.665731
2 0.968980 0.599549
3 0.332064 0.061759

In [86]:
pd.DataFrame({'student': np.arange(100, 110, 1), 'test': np.random.randint(60, 101, 10)}).set_index('student')


Out[86]:
test
student
100 66
101 85
102 72
103 98
104 82
105 72
106 78
107 82
108 75
109 60

In [90]:
s = pd.Series(['round', 'square'], index=['C', 'B'], name='shape')
s


Out[90]:
C     round
B    square
Name: shape, dtype: object

In [88]:
df


Out[88]:
id color
A 100 red
B 101 blue
C 102 red

In [91]:
pd.concat([df, s], axis=1)


Out[91]:
id color shape
A 100 red NaN
B 101 blue square
C 102 red round

How do I create dummy variables in pandas?


In [108]:
train = pd.read_csv('http://bit.ly/kaggletrain')
train


Out[108]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 0 345763 18.0000 NaN S
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
20 21 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 NaN S
21 22 1 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 D56 S
22 23 1 3 McGowan, Miss. Anna "Annie" female 15.0 0 0 330923 8.0292 NaN Q
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S
24 25 0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 NaN S
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 5 347077 31.3875 NaN S
26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.0000 C23 C25 C27 S
28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q
29 30 0 3 Todoroff, Mr. Lalio male NaN 0 0 349216 7.8958 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
861 862 0 2 Giles, Mr. Frederick Edward male 21.0 1 0 28134 11.5000 NaN S
862 863 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 0 0 17466 25.9292 D17 S
863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.5500 NaN S
864 865 0 2 Gill, Mr. John William male 24.0 0 0 233866 13.0000 NaN S
865 866 1 2 Bystrom, Mrs. (Karolina) female 42.0 0 0 236852 13.0000 NaN S
866 867 1 2 Duran y More, Miss. Asuncion female 27.0 1 0 SC/PARIS 2149 13.8583 NaN C
867 868 0 1 Roebling, Mr. Washington Augustus II male 31.0 0 0 PC 17590 50.4958 A24 S
868 869 0 3 van Melkebeke, Mr. Philemon male NaN 0 0 345777 9.5000 NaN S
869 870 1 3 Johnson, Master. Harold Theodor male 4.0 1 1 347742 11.1333 NaN S
870 871 0 3 Balkic, Mr. Cerin male 26.0 0 0 349248 7.8958 NaN S
871 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 1 11751 52.5542 D35 S
872 873 0 1 Carlsson, Mr. Frans Olof male 33.0 0 0 695 5.0000 B51 B53 B55 S
873 874 0 3 Vander Cruyssen, Mr. Victor male 47.0 0 0 345765 9.0000 NaN S
874 875 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 0 P/PP 3381 24.0000 NaN C
875 876 1 3 Najib, Miss. Adele Kiamie "Jane" female 15.0 0 0 2667 7.2250 NaN C
876 877 0 3 Gustafsson, Mr. Alfred Ossian male 20.0 0 0 7534 9.8458 NaN S
877 878 0 3 Petroff, Mr. Nedelio male 19.0 0 0 349212 7.8958 NaN S
878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
880 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 230433 26.0000 NaN S
881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S
882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S
883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 NaN S
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns


In [93]:
train['Sex_male'] = train.Sex.map({'female': 0, 'male': 1})

In [94]:
train.head()


Out[94]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Sex_male
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 0
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 1

In [97]:
pd.get_dummies(train.Sex, prefix='Sex').iloc[:, 1:]


Out[97]:
Sex_male
0 1
1 0
2 0
3 0
4 1
5 1
6 1
7 1
8 0
9 0
10 0
11 0
12 1
13 1
14 0
15 0
16 1
17 1
18 0
19 0
20 1
21 1
22 0
23 1
24 0
25 0
26 1
27 1
28 0
29 1
... ...
861 1
862 0
863 0
864 1
865 0
866 0
867 1
868 1
869 1
870 1
871 0
872 1
873 1
874 0
875 0
876 1
877 1
878 1
879 0
880 0
881 1
882 0
883 1
884 1
885 0
886 1
887 0
888 0
889 1
890 1

891 rows × 1 columns


In [99]:
train.Embarked.value_counts()


Out[99]:
S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [107]:
embarked_dummies = pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:]
train = pd.concat([train, embarked_dummies], axis=1)
train.head()


Out[107]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Embarked_Q Embarked_S
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 0 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 0 0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 0 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 0 1
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 0 1

In [110]:
pd.get_dummies(train, columns=['Sex', 'Embarked'], drop_first=True)


Out[110]:
PassengerId Survived Pclass Name Age SibSp Parch Ticket Fare Cabin Sex_male Embarked_Q Embarked_S
0 1 0 3 Braund, Mr. Owen Harris 22.0 1 0 A/5 21171 7.2500 NaN 1 0 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 38.0 1 0 PC 17599 71.2833 C85 0 0 0
2 3 1 3 Heikkinen, Miss. Laina 26.0 0 0 STON/O2. 3101282 7.9250 NaN 0 0 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 113803 53.1000 C123 0 0 1
4 5 0 3 Allen, Mr. William Henry 35.0 0 0 373450 8.0500 NaN 1 0 1
5 6 0 3 Moran, Mr. James NaN 0 0 330877 8.4583 NaN 1 1 0
6 7 0 1 McCarthy, Mr. Timothy J 54.0 0 0 17463 51.8625 E46 1 0 1
7 8 0 3 Palsson, Master. Gosta Leonard 2.0 3 1 349909 21.0750 NaN 1 0 1
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 27.0 0 2 347742 11.1333 NaN 0 0 1
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) 14.0 1 0 237736 30.0708 NaN 0 0 0
10 11 1 3 Sandstrom, Miss. Marguerite Rut 4.0 1 1 PP 9549 16.7000 G6 0 0 1
11 12 1 1 Bonnell, Miss. Elizabeth 58.0 0 0 113783 26.5500 C103 0 0 1
12 13 0 3 Saundercock, Mr. William Henry 20.0 0 0 A/5. 2151 8.0500 NaN 1 0 1
13 14 0 3 Andersson, Mr. Anders Johan 39.0 1 5 347082 31.2750 NaN 1 0 1
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina 14.0 0 0 350406 7.8542 NaN 0 0 1
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) 55.0 0 0 248706 16.0000 NaN 0 0 1
16 17 0 3 Rice, Master. Eugene 2.0 4 1 382652 29.1250 NaN 1 1 0
17 18 1 2 Williams, Mr. Charles Eugene NaN 0 0 244373 13.0000 NaN 1 0 1
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... 31.0 1 0 345763 18.0000 NaN 0 0 1
19 20 1 3 Masselmani, Mrs. Fatima NaN 0 0 2649 7.2250 NaN 0 0 0
20 21 0 2 Fynney, Mr. Joseph J 35.0 0 0 239865 26.0000 NaN 1 0 1
21 22 1 2 Beesley, Mr. Lawrence 34.0 0 0 248698 13.0000 D56 1 0 1
22 23 1 3 McGowan, Miss. Anna "Annie" 15.0 0 0 330923 8.0292 NaN 0 1 0
23 24 1 1 Sloper, Mr. William Thompson 28.0 0 0 113788 35.5000 A6 1 0 1
24 25 0 3 Palsson, Miss. Torborg Danira 8.0 3 1 349909 21.0750 NaN 0 0 1
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... 38.0 1 5 347077 31.3875 NaN 0 0 1
26 27 0 3 Emir, Mr. Farred Chehab NaN 0 0 2631 7.2250 NaN 1 0 0
27 28 0 1 Fortune, Mr. Charles Alexander 19.0 3 2 19950 263.0000 C23 C25 C27 1 0 1
28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" NaN 0 0 330959 7.8792 NaN 0 1 0
29 30 0 3 Todoroff, Mr. Lalio NaN 0 0 349216 7.8958 NaN 1 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
861 862 0 2 Giles, Mr. Frederick Edward 21.0 1 0 28134 11.5000 NaN 1 0 1
862 863 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... 48.0 0 0 17466 25.9292 D17 0 0 1
863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" NaN 8 2 CA. 2343 69.5500 NaN 0 0 1
864 865 0 2 Gill, Mr. John William 24.0 0 0 233866 13.0000 NaN 1 0 1
865 866 1 2 Bystrom, Mrs. (Karolina) 42.0 0 0 236852 13.0000 NaN 0 0 1
866 867 1 2 Duran y More, Miss. Asuncion 27.0 1 0 SC/PARIS 2149 13.8583 NaN 0 0 0
867 868 0 1 Roebling, Mr. Washington Augustus II 31.0 0 0 PC 17590 50.4958 A24 1 0 1
868 869 0 3 van Melkebeke, Mr. Philemon NaN 0 0 345777 9.5000 NaN 1 0 1
869 870 1 3 Johnson, Master. Harold Theodor 4.0 1 1 347742 11.1333 NaN 1 0 1
870 871 0 3 Balkic, Mr. Cerin 26.0 0 0 349248 7.8958 NaN 1 0 1
871 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) 47.0 1 1 11751 52.5542 D35 0 0 1
872 873 0 1 Carlsson, Mr. Frans Olof 33.0 0 0 695 5.0000 B51 B53 B55 1 0 1
873 874 0 3 Vander Cruyssen, Mr. Victor 47.0 0 0 345765 9.0000 NaN 1 0 1
874 875 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) 28.0 1 0 P/PP 3381 24.0000 NaN 0 0 0
875 876 1 3 Najib, Miss. Adele Kiamie "Jane" 15.0 0 0 2667 7.2250 NaN 0 0 0
876 877 0 3 Gustafsson, Mr. Alfred Ossian 20.0 0 0 7534 9.8458 NaN 1 0 1
877 878 0 3 Petroff, Mr. Nedelio 19.0 0 0 349212 7.8958 NaN 1 0 1
878 879 0 3 Laleff, Mr. Kristo NaN 0 0 349217 7.8958 NaN 1 0 1
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) 56.0 0 1 11767 83.1583 C50 0 0 0
880 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) 25.0 0 1 230433 26.0000 NaN 0 0 1
881 882 0 3 Markun, Mr. Johann 33.0 0 0 349257 7.8958 NaN 1 0 1
882 883 0 3 Dahlberg, Miss. Gerda Ulrika 22.0 0 0 7552 10.5167 NaN 0 0 1
883 884 0 2 Banfield, Mr. Frederick James 28.0 0 0 C.A./SOTON 34068 10.5000 NaN 1 0 1
884 885 0 3 Sutehall, Mr. Henry Jr 25.0 0 0 SOTON/OQ 392076 7.0500 NaN 1 0 1
885 886 0 3 Rice, Mrs. William (Margaret Norton) 39.0 0 5 382652 29.1250 NaN 0 1 0
886 887 0 2 Montvila, Rev. Juozas 27.0 0 0 211536 13.0000 NaN 1 0 1
887 888 1 1 Graham, Miss. Margaret Edith 19.0 0 0 112053 30.0000 B42 0 0 1
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" NaN 1 2 W./C. 6607 23.4500 NaN 0 0 1
889 890 1 1 Behr, Mr. Karl Howell 26.0 0 0 111369 30.0000 C148 1 0 0
890 891 0 3 Dooley, Mr. Patrick 32.0 0 0 370376 7.7500 NaN 1 1 0

891 rows × 13 columns

How do I find and remove duplicate rows in pandas?


In [112]:
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users = pd.read_table('http://bit.ly/movieusers', delimiter='|', header=None, names=user_cols, index_col='user_id')
users.head()


Out[112]:
age gender occupation zip_code
user_id
1 24 M technician 85711
2 53 F other 94043
3 23 M writer 32067
4 24 M technician 43537
5 33 F other 15213

In [113]:
users.shape


Out[113]:
(943, 4)

In [117]:
users.zip_code.duplicated().sum()


Out[117]:
148

In [119]:
users.duplicated().sum()


Out[119]:
7

In [123]:
users.loc[users.duplicated(keep=False), :]


Out[123]:
age gender occupation zip_code
user_id
67 17 M student 60402
85 51 M educator 20003
198 21 F student 55414
350 32 M student 97301
428 28 M student 55414
437 27 F other 20009
460 44 F other 60630
496 21 F student 55414
572 51 M educator 20003
621 17 M student 60402
684 28 M student 55414
733 44 F other 60630
805 27 F other 20009
890 32 M student 97301

In [124]:
users.drop_duplicates(keep='first').shape


Out[124]:
(936, 4)

In [125]:
users.duplicated(subset=['age', 'zip_code']).sum()


Out[125]:
16

How do I make my pandas DataFrame smaller and faster?


In [126]:
drinks = pd.read_csv('http://bit.ly/drinksbycountry')
drinks.head()


Out[126]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa

In [127]:
drinks.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
country                         193 non-null object
beer_servings                   193 non-null int64
spirit_servings                 193 non-null int64
wine_servings                   193 non-null int64
total_litres_of_pure_alcohol    193 non-null float64
continent                       193 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 9.1+ KB

In [128]:
drinks.info(memory_usage='deep')


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
country                         193 non-null object
beer_servings                   193 non-null int64
spirit_servings                 193 non-null int64
wine_servings                   193 non-null int64
total_litres_of_pure_alcohol    193 non-null float64
continent                       193 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 25.9 KB

In [130]:
drinks.memory_usage(deep=True)


Out[130]:
Index                              72
country                         10272
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                       10016
dtype: int64

In [132]:
sorted(drinks.continent.unique())


Out[132]:
['Africa', 'Asia', 'Europe', 'North America', 'Oceania', 'South America']

In [133]:
drinks['continent'] = drinks.continent.astype('category')

In [135]:
drinks.dtypes


Out[135]:
country                           object
beer_servings                      int64
spirit_servings                    int64
wine_servings                      int64
total_litres_of_pure_alcohol     float64
continent                       category
dtype: object

In [136]:
drinks.continent.cat.codes.head()


Out[136]:
0    1
1    2
2    0
3    2
4    0
dtype: int8

In [137]:
drinks.memory_usage(deep=True)


Out[137]:
Index                              72
country                         10272
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                         512
dtype: int64

In [141]:
drinks['country'] = drinks.country.astype('category')
drinks.memory_usage(deep=True)


Out[141]:
Index                              72
country                         10658
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                         512
dtype: int64

In [143]:
df = pd.DataFrame({'ID': [100, 101, 102, 103], 'quality': ['good', 'very good', 'good', 'excellent']})
df


Out[143]:
ID quality
0 100 good
1 101 very good
2 102 good
3 103 excellent

How do I select multiple rows and columns from a pandas DataFrame?


In [2]:
ufo = pd.read_csv('http://bit.ly/uforeports')

In [3]:
ufo.head(3)


Out[3]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00

In [4]:
ufo.loc[0, :]


Out[4]:
City                       Ithaca
Colors Reported               NaN
Shape Reported           TRIANGLE
State                          NY
Time               6/1/1930 22:00
Name: 0, dtype: object

In [7]:
ufo.loc[0:2, :]


Out[7]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00

In [11]:
ufo.loc[0, 'City':'State']


Out[11]:
City                 Ithaca
Colors Reported         NaN
Shape Reported     TRIANGLE
State                    NY
Name: 0, dtype: object

In [12]:
ufo.loc[ufo.City=='Oakland', :]


Out[12]:
City Colors Reported Shape Reported State Time
1694 Oakland NaN CIGAR CA 7/21/1968 14:00
2144 Oakland NaN DISK CA 8/19/1971 0:00
4686 Oakland NaN LIGHT MD 6/1/1982 0:00
7293 Oakland NaN LIGHT CA 3/28/1994 17:00
8488 Oakland NaN NaN CA 8/10/1995 21:45
8768 Oakland NaN NaN CA 10/10/1995 22:40
10816 Oakland NaN LIGHT OR 10/1/1997 21:30
10948 Oakland NaN DISK CA 11/14/1997 19:55
11045 Oakland NaN TRIANGLE CA 12/10/1997 1:30
12322 Oakland NaN FIREBALL CA 10/9/1998 19:40
12941 Oakland NaN CYLINDER CA 1/23/1999 21:30
16803 Oakland NaN TRIANGLE MD 7/4/2000 23:00
17322 Oakland NaN CYLINDER CA 9/1/2000 21:35

In [14]:
ufo.iloc[:, [0, 3]]


Out[14]:
City State
0 Ithaca NY
1 Willingboro NJ
2 Holyoke CO
3 Abilene KS
4 New York Worlds Fair NY
5 Valley City ND
6 Crater Lake CA
7 Alma MI
8 Eklutna AK
9 Hubbard OR
10 Fontana CA
11 Waterloo AL
12 Belton SC
13 Keokuk IA
14 Ludington MI
15 Forest Home CA
16 Los Angeles CA
17 Hapeville GA
18 Oneida TN
19 Bering Sea AK
20 Nebraska NE
21 NaN LA
22 NaN LA
23 Owensboro KY
24 Wilderness WV
25 San Diego CA
26 Wilderness WV
27 Clovis NM
28 Los Alamos NM
29 Ft. Duschene UT
... ... ...
18211 Holyoke MA
18212 Carson CA
18213 Pasadena CA
18214 Austin TX
18215 El Campo TX
18216 Garden Grove CA
18217 Berthoud Pass CO
18218 Sisterdale TX
18219 Garden Grove CA
18220 Shasta Lake CA
18221 Franklin NH
18222 Albrightsville PA
18223 Greenville SC
18224 Eufaula OK
18225 Simi Valley CA
18226 San Francisco CA
18227 San Francisco CA
18228 Kingsville TX
18229 Chicago IL
18230 Pismo Beach CA
18231 Pismo Beach CA
18232 Lodi WI
18233 Anchorage AK
18234 Capitola CA
18235 Fountain Hills AZ
18236 Grant Park IL
18237 Spirit Lake IA
18238 Eagle River WI
18239 Eagle River WI
18240 Ybor FL

18241 rows × 2 columns

How do I use pandas with scikit-learn to create Kaggle submissions?


In [38]:
train = pd.read_csv('http://bit.ly/kaggletrain')

In [39]:
train.head()


Out[39]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

In [40]:
feature_cols = ['Pclass', 'Parch']
X = train.loc[:, feature_cols]

In [41]:
y = train.Survived

In [42]:
from sklearn import linear_model
logreg = linear_model.LogisticRegression()
logreg.fit(X, y)


Out[42]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [46]:
test = pd.read_csv('http://bit.ly/kaggletest')
X_new = test.loc[:, feature_cols]

In [47]:
y_predict = logreg.predict(X_new)

In [48]:
y_predict.shape


Out[48]:
(418,)

In [50]:
pd.DataFrame({'PassengerId': test.PassengerId, 'Survived': y_predict}).set_index('PassengerId').to_csv('sub.csv')

In [51]:
train.to_pickle('train.pkl')

In [52]:
pd.read_pickle('train.pkl')


Out[52]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 0 345763 18.0000 NaN S
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
20 21 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 NaN S
21 22 1 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 D56 S
22 23 1 3 McGowan, Miss. Anna "Annie" female 15.0 0 0 330923 8.0292 NaN Q
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S
24 25 0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 NaN S
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 5 347077 31.3875 NaN S
26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.0000 C23 C25 C27 S
28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q
29 30 0 3 Todoroff, Mr. Lalio male NaN 0 0 349216 7.8958 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
861 862 0 2 Giles, Mr. Frederick Edward male 21.0 1 0 28134 11.5000 NaN S
862 863 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 0 0 17466 25.9292 D17 S
863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.5500 NaN S
864 865 0 2 Gill, Mr. John William male 24.0 0 0 233866 13.0000 NaN S
865 866 1 2 Bystrom, Mrs. (Karolina) female 42.0 0 0 236852 13.0000 NaN S
866 867 1 2 Duran y More, Miss. Asuncion female 27.0 1 0 SC/PARIS 2149 13.8583 NaN C
867 868 0 1 Roebling, Mr. Washington Augustus II male 31.0 0 0 PC 17590 50.4958 A24 S
868 869 0 3 van Melkebeke, Mr. Philemon male NaN 0 0 345777 9.5000 NaN S
869 870 1 3 Johnson, Master. Harold Theodor male 4.0 1 1 347742 11.1333 NaN S
870 871 0 3 Balkic, Mr. Cerin male 26.0 0 0 349248 7.8958 NaN S
871 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 1 11751 52.5542 D35 S
872 873 0 1 Carlsson, Mr. Frans Olof male 33.0 0 0 695 5.0000 B51 B53 B55 S
873 874 0 3 Vander Cruyssen, Mr. Victor male 47.0 0 0 345765 9.0000 NaN S
874 875 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 0 P/PP 3381 24.0000 NaN C
875 876 1 3 Najib, Miss. Adele Kiamie "Jane" female 15.0 0 0 2667 7.2250 NaN C
876 877 0 3 Gustafsson, Mr. Alfred Ossian male 20.0 0 0 7534 9.8458 NaN S
877 878 0 3 Petroff, Mr. Nedelio male 19.0 0 0 349212 7.8958 NaN S
878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
880 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 230433 26.0000 NaN S
881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S
882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S
883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 NaN S
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

How do I use string methods in pandas?


In [53]:
orders = pd.read_table('http://bit.ly/chiporders')
orders.head()


Out[53]:
order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
1 1 1 Izze [Clementine] $3.39
2 1 1 Nantucket Nectar [Apple] $3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98

In [54]:
orders.item_name.str.upper()


Out[54]:
0                CHIPS AND FRESH TOMATO SALSA
1                                        IZZE
2                            NANTUCKET NECTAR
3       CHIPS AND TOMATILLO-GREEN CHILI SALSA
4                                CHICKEN BOWL
5                                CHICKEN BOWL
6                               SIDE OF CHIPS
7                               STEAK BURRITO
8                            STEAK SOFT TACOS
9                               STEAK BURRITO
10                        CHIPS AND GUACAMOLE
11                       CHICKEN CRISPY TACOS
12                         CHICKEN SOFT TACOS
13                               CHICKEN BOWL
14                        CHIPS AND GUACAMOLE
15      CHIPS AND TOMATILLO-GREEN CHILI SALSA
16                            CHICKEN BURRITO
17                            CHICKEN BURRITO
18                                CANNED SODA
19                               CHICKEN BOWL
20                        CHIPS AND GUACAMOLE
21                           BARBACOA BURRITO
22                           NANTUCKET NECTAR
23                            CHICKEN BURRITO
24                                       IZZE
25               CHIPS AND FRESH TOMATO SALSA
26                               CHICKEN BOWL
27                           CARNITAS BURRITO
28                                CANNED SODA
29                            CHICKEN BURRITO
                        ...                  
4592                         BARBACOA BURRITO
4593                            CARNITAS BOWL
4594                            BARBACOA BOWL
4595                             CHICKEN BOWL
4596                      CHIPS AND GUACAMOLE
4597                        CANNED SOFT DRINK
4598                            BOTTLED WATER
4599                             CHICKEN BOWL
4600                      CHIPS AND GUACAMOLE
4601                        CANNED SOFT DRINK
4602                         BARBACOA BURRITO
4603                         BARBACOA BURRITO
4604                             CHICKEN BOWL
4605                      CHIPS AND GUACAMOLE
4606                        CANNED SOFT DRINK
4607                            STEAK BURRITO
4608                           VEGGIE BURRITO
4609                        CANNED SOFT DRINK
4610                            STEAK BURRITO
4611                           VEGGIE BURRITO
4612                            CARNITAS BOWL
4613                                    CHIPS
4614                            BOTTLED WATER
4615                       CHICKEN SOFT TACOS
4616                      CHIPS AND GUACAMOLE
4617                            STEAK BURRITO
4618                            STEAK BURRITO
4619                       CHICKEN SALAD BOWL
4620                       CHICKEN SALAD BOWL
4621                       CHICKEN SALAD BOWL
Name: item_name, dtype: object

In [56]:
orders[orders.item_name.str.contains('Chicken')]


Out[56]:
order_id quantity item_name choice_description item_price
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98
5 3 1 Chicken Bowl [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... $10.98
11 6 1 Chicken Crispy Tacos [Roasted Chili Corn Salsa, [Fajita Vegetables,... $8.75
12 6 1 Chicken Soft Tacos [Roasted Chili Corn Salsa, [Rice, Black Beans,... $8.75
13 7 1 Chicken Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Rice,... $11.25
16 8 1 Chicken Burrito [Tomatillo-Green Chili Salsa (Medium), [Pinto ... $8.49
17 9 1 Chicken Burrito [Fresh Tomato Salsa (Mild), [Black Beans, Rice... $8.49
19 10 1 Chicken Bowl [Tomatillo Red Chili Salsa, [Fajita Vegetables... $8.75
23 12 1 Chicken Burrito [[Tomatillo-Green Chili Salsa (Medium), Tomati... $10.98
26 13 1 Chicken Bowl [Roasted Chili Corn Salsa (Medium), [Pinto Bea... $8.49
29 15 1 Chicken Burrito [Tomatillo-Green Chili Salsa (Medium), [Pinto ... $8.49
35 18 1 Chicken Soft Tacos [Roasted Chili Corn Salsa, Rice] $8.75
36 18 1 Chicken Soft Tacos [Roasted Chili Corn Salsa, [Cheese, Lettuce]] $8.75
42 20 1 Chicken Bowl [Roasted Chili Corn Salsa, [Rice, Black Beans,... $11.25
44 20 1 Chicken Salad Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Pinto... $8.75
45 21 1 Chicken Burrito [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $10.98
52 24 1 Chicken Burrito [Roasted Chili Corn Salsa (Medium), [Black Bea... $10.98
63 28 1 Chicken Burrito [Fresh Tomato Salsa, [Rice, Black Beans, Cheese]] $8.75
68 30 1 Chicken Burrito [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $10.98
73 33 1 Chicken Burrito [Tomatillo Red Chili Salsa, [Rice, Black Beans... $8.75
75 34 1 Chicken Soft Tacos [Tomatillo Red Chili Salsa, [Rice, Cheese, Sou... $8.75
76 34 1 Chicken Bowl [Fresh Tomato Salsa, [Rice, Black Beans, Pinto... $8.75
78 34 1 Chicken Bowl [Fresh Tomato Salsa, [Rice, Black Beans, Chees... $8.75
79 35 1 Chicken Soft Tacos [Roasted Chili Corn Salsa, [Rice, Cheese, Lett... $11.25
82 36 1 Chicken Burrito [Fresh Tomato Salsa, [Rice, Black Beans, Chees... $8.75
83 37 1 Chicken Burrito [Tomatillo Green Chili Salsa, [Rice, Black Bea... $8.75
98 43 1 Chicken Burrito [Tomatillo Red Chili Salsa, [Rice, Black Beans... $8.75
99 44 1 Chicken Bowl [Tomatillo Red Chili Salsa, [Rice, Fajita Vege... $8.75
110 49 1 Chicken Bowl [Tomatillo Red Chili Salsa, [Rice, Black Beans... $8.75
114 50 1 Chicken Burrito [Tomatillo-Green Chili Salsa (Medium), [Pinto ... $8.49
... ... ... ... ... ...
4537 1804 1 Chicken Bowl [Fresh Tomato Salsa, [Rice, Black Beans, Chees... $8.75
4540 1805 1 Chicken Bowl [Tomatillo Green Chili Salsa, [Fajita Vegetabl... $11.25
4543 1806 1 Chicken Burrito [Fresh Tomato Salsa, [Fajita Vegetables, Rice]] $8.75
4550 1809 1 Chicken Soft Tacos [Fresh Tomato Salsa, [Fajita Vegetables, Chees... $8.75
4553 1810 1 Chicken Bowl [Roasted Chili Corn Salsa, [Black Beans, Sour ... $11.25
4555 1811 1 Chicken Bowl [Roasted Chili Corn Salsa, [Fajita Vegetables,... $8.75
4556 1811 1 Chicken Burrito [Tomatillo Green Chili Salsa, [Fajita Vegetabl... $11.25
4558 1812 1 Chicken Burrito [Fresh Tomato Salsa, [Rice, Black Beans, Chees... $8.75
4559 1812 1 Chicken Burrito [Tomatillo Red Chili Salsa, [Rice, Cheese, Gua... $11.25
4561 1813 2 Chicken Salad Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Pinto... $17.50
4564 1815 1 Chicken Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Rice,... $11.25
4566 1816 1 Chicken Burrito [Roasted Chili Corn Salsa, [Pinto Beans, Chees... $8.75
4569 1817 1 Chicken Burrito [Fresh Tomato Salsa, [Fajita Vegetables, Rice]] $8.75
4574 1819 1 Chicken Bowl [Roasted Chili Corn Salsa, [Rice, Cheese, Lett... $8.75
4576 1820 1 Chicken Bowl [Fresh Tomato Salsa, [Rice, Cheese, Sour Cream... $11.25
4580 1822 1 Chicken Salad Bowl [Fresh Tomato Salsa, [Black Beans, Cheese, Gua... $11.25
4581 1822 1 Chicken Bowl [Tomatillo Red Chili Salsa, [Rice, Black Beans... $8.75
4583 1823 1 Chicken Bowl [Tomatillo Green Chili Salsa, [Rice, Black Bea... $8.75
4586 1824 1 Chicken Bowl [Fresh Tomato Salsa, [Rice, Black Beans, Chees... $11.25
4587 1824 1 Chicken Bowl [Fresh Tomato Salsa, [Rice, Black Beans, Chees... $11.25
4589 1825 1 Chicken Bowl [Fresh Tomato Salsa, [Rice, Black Beans, Sour ... $11.25
4590 1825 1 Chicken Bowl [Roasted Chili Corn Salsa, [Rice, Black Beans,... $11.25
4591 1825 1 Chicken Bowl [Tomatillo Red Chili Salsa, [Rice, Black Beans... $8.75
4595 1826 1 Chicken Bowl [Tomatillo Green Chili Salsa, [Rice, Black Bea... $8.75
4599 1827 1 Chicken Bowl [Roasted Chili Corn Salsa, [Cheese, Lettuce]] $8.75
4604 1828 1 Chicken Bowl [Fresh Tomato Salsa, [Rice, Black Beans, Chees... $8.75
4615 1832 1 Chicken Soft Tacos [Fresh Tomato Salsa, [Rice, Cheese, Sour Cream]] $8.75
4619 1834 1 Chicken Salad Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Pinto... $11.25
4620 1834 1 Chicken Salad Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Lettu... $8.75
4621 1834 1 Chicken Salad Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Pinto... $8.75

1560 rows × 5 columns

How do I use the "axis" parameter in pandas?


In [57]:
drinks = pd.read_csv('http://bit.ly/drinksbycountry')
drinks.head()


Out[57]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa

In [59]:
drinks.drop(2, axis=0).head()


Out[59]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
5 Antigua & Barbuda 102 128 45 4.9 North America

In [62]:
drinks.mean(axis=1).head()


Out[62]:
0      0.000
1     69.975
2      9.925
3    176.850
4     81.225
dtype: float64

How do I work with dates and times in pandas?


In [63]:
ufo = pd.read_csv('http://bit.ly/uforeports')
ufo.head()


Out[63]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00

In [64]:
ufo.dtypes


Out[64]:
City               object
Colors Reported    object
Shape Reported     object
State              object
Time               object
dtype: object

In [69]:
ufo.Time.str.slice(-5, -3).astype(int).head()


Out[69]:
0    22
1    20
2    14
3    13
4    19
Name: Time, dtype: int64

In [70]:
ufo['Time'] = pd.to_datetime(ufo.Time)
ufo.dtypes


Out[70]:
City                       object
Colors Reported            object
Shape Reported             object
State                      object
Time               datetime64[ns]
dtype: object

In [71]:
ufo.head()


Out[71]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 1930-06-01 22:00:00
1 Willingboro NaN OTHER NJ 1930-06-30 20:00:00
2 Holyoke NaN OVAL CO 1931-02-15 14:00:00
3 Abilene NaN DISK KS 1931-06-01 13:00:00
4 New York Worlds Fair NaN LIGHT NY 1933-04-18 19:00:00

In [75]:
ufo.Time.dt.dayofyear.head()


Out[75]:
0    152
1    181
2     46
3    152
4    108
Name: Time, dtype: int64

In [76]:
ts = pd.to_datetime('1/1/1999')

In [79]:
ufo.loc[ufo.Time > ts, :].head()


Out[79]:
City Colors Reported Shape Reported State Time
12832 Loma Rica NaN LIGHT CA 1999-01-01 02:30:00
12833 Bauxite NaN NaN AR 1999-01-01 03:00:00
12834 Florence NaN CYLINDER SC 1999-01-01 14:00:00
12835 Lake Henshaw NaN CIGAR CA 1999-01-01 15:00:00
12836 Wilmington Island NaN LIGHT GA 1999-01-01 17:15:00

In [80]:
%matplotlib inline

In [81]:
ufo['Year'] = ufo.Time.dt.year
ufo.head()


Out[81]:
City Colors Reported Shape Reported State Time Year
0 Ithaca NaN TRIANGLE NY 1930-06-01 22:00:00 1930
1 Willingboro NaN OTHER NJ 1930-06-30 20:00:00 1930
2 Holyoke NaN OVAL CO 1931-02-15 14:00:00 1931
3 Abilene NaN DISK KS 1931-06-01 13:00:00 1931
4 New York Worlds Fair NaN LIGHT NY 1933-04-18 19:00:00 1933

In [85]:
ufo.Year.value_counts().sort_index().plot()


Out[85]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d082d10>

In [86]:
ufo.sample(3)


Out[86]:
City Colors Reported Shape Reported State Time Year
17939 Greenup NaN FIREBALL KY 2000-11-14 21:00:00 2000
4164 Wenatchee NaN TRIANGLE WA 1979-08-15 22:00:00 1979
9884 Millsboro NaN TRIANGLE DE 1997-01-15 18:00:00 1997

In [88]:
ufo.sample(frac=0.001)


Out[88]:
City Colors Reported Shape Reported State Time Year
2983 Shoreline NaN EGG WA 1975-07-13 19:30:00 1975
14529 Cotton NaN TRIANGLE MN 1999-09-10 22:00:00 1999
5330 Pendleton NaN RECTANGLE OR 1985-09-01 13:30:00 1985
10508 St. Louis BLUE LIGHT MO 1997-07-04 21:15:00 1997
5241 Route 7 NaN RECTANGLE NY 1985-06-15 02:30:00 1985
16331 Plano NaN TRIANGLE TX 2000-04-27 23:38:00 2000
17198 Beaumont NaN NaN TX 2000-08-15 17:00:00 2000
7942 Everett NaN NaN WA 1995-03-22 21:30:00 1995
4934 Grand Prairie NaN NaN TX 1983-07-15 18:00:00 1983
9257 Langley NaN NaN WA 1996-04-26 03:30:00 1996
9402 Carlsbad NaN DISK NM 1996-06-15 12:00:00 1996
16322 San Rafael NaN LIGHT CA 2000-04-25 23:30:00 2000
44 Blairsden GREEN SPHERE CA 1946-06-30 19:00:00 1946
9834 Poulsbo NaN NaN WA 1996-12-21 00:00:00 1996
9216 Phillipsburg NaN NaN PA 1996-04-12 23:55:00 1996
3983 North Branch NaN RECTANGLE MN 1978-11-21 23:45:00 1978
7158 Bellevue NaN SPHERE WA 1993-08-10 20:30:00 1993
7029 Topeka NaN DISK KS 1993-05-22 20:50:00 1993

What do I need to know about the pandas index?


In [89]:
drinks = pd.read_csv('http://bit.ly/drinksbycountry')
drinks.head()


Out[89]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa

In [90]:
drinks.index


Out[90]:
RangeIndex(start=0, stop=193, step=1)

In [91]:
drinks.loc[23, :]


Out[91]:
country                                Brazil
beer_servings                             245
spirit_servings                           145
wine_servings                              16
total_litres_of_pure_alcohol              7.2
continent                       South America
Name: 23, dtype: object

In [92]:
drinks.set_index('country', inplace=True)

In [93]:
drinks.head()


Out[93]:
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
country
Afghanistan 0 0 0 0.0 Asia
Albania 89 132 54 4.9 Europe
Algeria 25 0 14 0.7 Africa
Andorra 245 138 312 12.4 Europe
Angola 217 57 45 5.9 Africa

In [94]:
drinks.loc['Brazil', 'beer_servings']


Out[94]:
245

In [95]:
drinks.index.name = None

In [96]:
drinks.head()


Out[96]:
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
Afghanistan 0 0 0 0.0 Asia
Albania 89 132 54 4.9 Europe
Algeria 25 0 14 0.7 Africa
Andorra 245 138 312 12.4 Europe
Angola 217 57 45 5.9 Africa

In [97]:
drinks.reset_index(inplace=True)

In [98]:
drinks.head()


Out[98]:
index beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa

In [99]:
drinks = pd.read_csv('http://bit.ly/drinksbycountry')
drinks.head()


Out[99]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa

In [101]:
drinks.set_index('country', inplace=True)
drinks.head()


Out[101]:
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
country
Afghanistan 0 0 0 0.0 Asia
Albania 89 132 54 4.9 Europe
Algeria 25 0 14 0.7 Africa
Andorra 245 138 312 12.4 Europe
Angola 217 57 45 5.9 Africa

In [103]:
drinks.continent.head()


Out[103]:
country
Afghanistan      Asia
Albania        Europe
Algeria        Africa
Andorra        Europe
Angola         Africa
Name: continent, dtype: object

In [105]:
drinks.continent.value_counts()['Africa']


Out[105]:
53

In [106]:
drinks.continent.value_counts().sort_index().head()


Out[106]:
Africa           53
Asia             44
Europe           45
North America    23
Oceania          16
Name: continent, dtype: int64

In [107]:
people = pd.Series([3000000, 85000], index=['Albania', 'Andorra'], name='population')
people


Out[107]:
Albania    3000000
Andorra      85000
Name: population, dtype: int64

In [109]:
drinks.beer_servings * people


Out[109]:
Afghanistan                     NaN
Albania                 267000000.0
Algeria                         NaN
Andorra                  20825000.0
Angola                          NaN
Antigua & Barbuda               NaN
Argentina                       NaN
Armenia                         NaN
Australia                       NaN
Austria                         NaN
Azerbaijan                      NaN
Bahamas                         NaN
Bahrain                         NaN
Bangladesh                      NaN
Barbados                        NaN
Belarus                         NaN
Belgium                         NaN
Belize                          NaN
Benin                           NaN
Bhutan                          NaN
Bolivia                         NaN
Bosnia-Herzegovina              NaN
Botswana                        NaN
Brazil                          NaN
Brunei                          NaN
Bulgaria                        NaN
Burkina Faso                    NaN
Burundi                         NaN
Cabo Verde                      NaN
Cambodia                        NaN
                           ...     
Sudan                           NaN
Suriname                        NaN
Swaziland                       NaN
Sweden                          NaN
Switzerland                     NaN
Syria                           NaN
Tajikistan                      NaN
Tanzania                        NaN
Thailand                        NaN
Timor-Leste                     NaN
Togo                            NaN
Tonga                           NaN
Trinidad & Tobago               NaN
Tunisia                         NaN
Turkey                          NaN
Turkmenistan                    NaN
Tuvalu                          NaN
USA                             NaN
Uganda                          NaN
Ukraine                         NaN
United Arab Emirates            NaN
United Kingdom                  NaN
Uruguay                         NaN
Uzbekistan                      NaN
Vanuatu                         NaN
Venezuela                       NaN
Vietnam                         NaN
Yemen                           NaN
Zambia                          NaN
Zimbabwe                        NaN
dtype: float64

In [110]:
pd.concat([drinks, people], axis=1).head()


Out[110]:
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent population
Afghanistan 0 0 0 0.0 Asia NaN
Albania 89 132 54 4.9 Europe 3000000.0
Algeria 25 0 14 0.7 Africa NaN
Andorra 245 138 312 12.4 Europe 85000.0
Angola 217 57 45 5.9 Africa NaN

When should I use a "groupby" in pandas?


In [111]:
drinks = pd.read_csv('http://bit.ly/drinksbycountry')
drinks.head()


Out[111]:
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa

In [112]:
drinks.beer_servings.mean()


Out[112]:
106.16062176165804

In [114]:
drinks.groupby('continent').beer_servings.mean()


Out[114]:
continent
Africa            61.471698
Asia              37.045455
Europe           193.777778
North America    145.434783
Oceania           89.687500
South America    175.083333
Name: beer_servings, dtype: float64

In [116]:
drinks[drinks.continent=='Africa'].beer_servings.mean()


Out[116]:
61.471698113207545

In [117]:
drinks.groupby('continent').beer_servings.agg(['count', 'min', 'max', 'mean'])


Out[117]:
count min max mean
continent
Africa 53 0 376 61.471698
Asia 44 0 247 37.045455
Europe 45 0 361 193.777778
North America 23 1 285 145.434783
Oceania 16 0 306 89.687500
South America 12 93 333 175.083333

In [118]:
%matplotlib inline

In [121]:
drinks.groupby('continent').mean().plot(kind='bar')


Out[121]:
<matplotlib.axes._subplots.AxesSubplot at 0x10de95190>

When should I use "inplace" parameter in pandas?


In [126]:
ufo = pd.read_csv('http://bit.ly/uforeports')
ufo.head()


Out[126]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00

In [127]:
ufo.drop('City', axis=1, inplace=True)

In [128]:
ufo.head()


Out[128]:
Colors Reported Shape Reported State Time
0 NaN TRIANGLE NY 6/1/1930 22:00
1 NaN OTHER NJ 6/30/1930 20:00
2 NaN OVAL CO 2/15/1931 14:00
3 NaN DISK KS 6/1/1931 13:00
4 NaN LIGHT NY 4/18/1933 19:00

In [130]:
ufo.dropna(how='any').shape


Out[130]:
(2490, 4)

In [131]:
ufo.shape


Out[131]:
(18241, 4)